Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers
Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers
Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may exclude useful (if uncertain) answers. Thus, users frequently resort to less principled approaches to resolve the uncertainty. In this paper, we propose Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-approximation of certain answers to achieve the reliability of certain answers, with the performance of a classical database system. Furthermore, in contrast to prior work on certain answers, UA-DBs achieve a higher utility by including some (explicitly marked) answers that are not certain. UA-DBs are based on incomplete K-relations, which we introduce to generalize the classical set-based notions of incomplete databases and certain answers to a much larger class of data models. Using an implementation of our approach, we demonstrate experimentally that it efficiently produces tight approximations of certain answers that are of high utility.
tabsize=2, basicstyle=, language=SQL, morekeywords=PROVENANCE,BASERELATION,INFLUENCE,COPY,ON,TRANSPROV,TRANSSQL,TRANSXML,CONTRIBUTION,COMPLETE,TRANSITIVE,NONTRANSITIVE,EXPLAIN,SQLTEXT,GRAPH,IS,ANNOT,THIS,XSLT,MAPPROV,cxpath,OF,TRANSACTION,SERIALIZABLE,COMMITTED,INSERT,INTO,WITH,SCN,UPDATED,LENS,SCHEMA_MATCHING,string,WINDOW,max,OVER,PARTITION,FIRST_VALUE,WITH, extendedchars=false, keywordstyle=, mathescape=true, escapechar=@, sensitive=true tabsize=2, basicstyle=, language=SQL, morekeywords=PROVENANCE,BASERELATION,INFLUENCE,COPY,ON,TRANSPROV,TRANSSQL,TRANSXML,CONTRIBUTION,COMPLETE,TRANSITIVE,NONTRANSITIVE,EXPLAIN,SQLTEXT,GRAPH,IS,ANNOT,THIS,XSLT,MAPPROV,cxpath,OF,TRANSACTION,SERIALIZABLE,COMMITTED,INSERT,INTO,WITH,SCN,UPDATED, extendedchars=false, keywordstyle=, deletekeywords=count,min,max,avg,sum, keywords=count,min,max,avg,sum, keywordstyle=, stringstyle=, commentstyle=, mathescape=true, escapechar=@, sensitive=true basicstyle=, language=prolog tabsize=3, basicstyle=, language=c, morekeywords=if,else,foreach,case,return,in,or, extendedchars=true, mathescape=true, literate=:=1 ¡=1 !=1 append1 calP2, keywordstyle=, escapechar=&, numbers=left, numberstyle=, stepnumber=1, numbersep=5pt, tabsize=3, basicstyle=, language=xml, extendedchars=true, mathescape=true, escapechar=£, tagstyle=, usekeywordsintag=true, morekeywords=alias,name,id, keywordstyle= tabsize=3, basicstyle=, language=xml, extendedchars=true, mathescape=true, escapechar=£, tagstyle=, usekeywordsintag=true, morekeywords=alias,name,id, keywordstyle=
tabsize=2, basicstyle=, language=SQL, morekeywords=PROVENANCE,BASERELATION,INFLUENCE,COPY,ON,TRANSPROV,TRANSSQL,TRANSXML,CONTRIBUTION,COMPLETE,TRANSITIVE,NONTRANSITIVE,EXPLAIN,SQLTEXT,GRAPH,IS,ANNOT,THIS,XSLT,MAPPROV,cxpath,OF,TRANSACTION,SERIALIZABLE,COMMITTED,INSERT,INTO,WITH,SCN,UPDATED,LENS,SCHEMA_MATCHING,string,WINDOW,max,OVER,PARTITION,FIRST_VALUE,WITH, extendedchars=false, keywordstyle=, mathescape=true, escapechar=@, sensitive=true
tabsize=2, basicstyle=, language=SQL, morekeywords=PROVENANCE,BASERELATION,INFLUENCE,COPY,ON,TRANSPROV,TRANSSQL,TRANSXML,CONTRIBUTION,COMPLETE,TRANSITIVE,NONTRANSITIVE,EXPLAIN,SQLTEXT,GRAPH,IS,ANNOT,THIS,XSLT,MAPPROV,cxpath,OF,TRANSACTION,SERIALIZABLE,COMMITTED,INSERT,INTO,WITH,SCN,UPDATED, extendedchars=false, keywordstyle=, deletekeywords=count,min,max,avg,sum, keywords=count,min,max,avg,sum, keywordstyle=, stringstyle=, commentstyle=, mathescape=true, escapechar=@, sensitive=true
tabsize=3, basicstyle=, language=c, morekeywords=if,else,foreach,case,return,in,or, extendedchars=true, mathescape=true, literate=:=1 ¡=1 !=1 append1 calP2, keywordstyle=, escapechar=&, numbers=left, numberstyle=, stepnumber=1, numbersep=5pt,
tabsize=3, basicstyle=, language=xml, extendedchars=true, mathescape=true, escapechar=£, tagstyle=, usekeywordsintag=true, morekeywords=alias,name,id, keywordstyle=
tabsize=3, basicstyle=, language=xml, extendedchars=true, mathescape=true, escapechar=£, tagstyle=, usekeywordsintag=true, morekeywords=alias,name,id, keywordstyle=
Data uncertainty arises naturally in applications like sensing (Letchner et al., 2009), data exchange (Fagin et al., 2011), distributed computing (Lang et al., 2014), data cleaning (Chu et al., 2015), and many others. Incomplete (Imielinski and Jr., 1984) and probabilistic databases (Suciu et al., 2011) have emerged as a principled way to deal with uncertainty. Both types of databases consist of a set of deterministic instances called possible worlds that represent possible interpretations of data available about the real world. An often cited, conservative approach to uncertainty is to consider only certain answers (Abiteboul et al., 1991; Imielinski and Jr., 1984) (answers in all possible worlds). However, this approach has two problems. First, computing certain answers is expensive111coNP-complete (Abiteboul et al., 1991; Imielinski and Jr., 1984) (data complexity) for first-order queries over V-tables (Imielinski and Jr., 1984), as well as for conjunctive queries for, e.g., OR-databases (Imielinski et al., 1995).. Furthermore, requiring answers to be certain may unnecessarily exclude useful, possible answers. Thus, users instead resort to what we term best-guess query processing (BGQP): making an educated guess about which possible world to use (i.e., how to interpret available data) and then working exclusively with this world. BGQP is more efficient than certain answers, and generally includes more useful results. However, information about uncertainty in the data is lost, and all query results produced by BGQP are consequently suspect.
Previous work has also explored approximations of certain answers (Libkin, 2016; Geerts et al., 2017; Reiter, 1986). Under the premise that missing a certain answer is better than incorrectly reporting an answer as certain, such work focuses on under\hypapproximating certain answers. This addresses the performance problem, but under\hypapproximations only exacerbate the problem of excluded results. Worse, these techniques are limited to specific uncertain data models such as V-tables, and with the exception of a brief discussion in (Guagliardo and Libkin, 2017), only support set semantics.
Example 1 ().
Geocoders translate natural language descriptions of locations into coordinates (i.e., latitude and longitude). Consider the ADDR and LOC relations in Figure 2. Tuples 2 and 3 of ADDR each have an ambiguous geocoding. This is an x-table (Agrawal et al., 2006), a type of incomplete data model where each tuple may have multiple alternatives. Each possible world is defined by some combination of alternatives (e.g., ADDR encodes 4 possible worlds). An analyst might use a spatial join with a lookup table (LOC) to map coordinates to geographic regions. Figure 2(a) shows the result of the following query in one world.
The certain answers to this query (Figure 2(b)) are tuples that appear in the result, regardless of which world is queried. Figure 2(c) shows all possible answers that could be returned for some choice of geocodings. Note also that ambiguous answers (e.g., address 2) may not be certain, but may still be useful.
Ideally, we would like an approach that (1) generalizes to a wide range of data models, (2) is easy to use like BGQP, (3) is compatible with a wide of probabilistic and incomplete data representations (e.g., tuple-independent databases (Suciu et al., 2011), C-tables (Imielinski and Jr., 1984), and x-DBs (Agrawal et al., 2006)) and sources of uncertainty (e.g., inconsistent databases (Koutris and Wijsen, 2018; Kolaitis et al., 2013; Kolaitis and Pema, 2012; Bertossi, 2011; Fuxman and Miller, 2005; Arenas et al., 1999), imputation of missing values, and more), and (4) is principled like certain answers. We address the generality requirement (1) by rethinking incomplete data management in terms of Green et. al.’s -database framework (Green et al., 2007). In this framework, each tuple is annotated with an value from a semiring . Choosing an appropriate semiring, -databases can encode a wide range of query processing semantics including classical set- and bag-semantics, as well as query processing with access control, provenance, and more. Our primary contribution here is to identify a natural, backwards-compatible generalization of certain answers to a broad class of -databases.
Our second major contribution is to combine an under\hypapproximation of certain answers with best-guess query processing to create an Uncertainty-Annotated Database (UA-DB). A UA-DB is built around one distinguished possible world of an incomplete -database, for instance the “best-guess” world that would normally be used in practice. This world serves as an over\hypapproximation of certain answers. Tuples from this world are labeled as either certain or uncertain to encode an under\hypapproximation of certain answers. As illustrated in Figure 1, a UA-DB sandwiches the certain answers between under- and over-approximations. A lightweight (extensional (Suciu et al., 2011)) query evaluation semantics then propagates labels while preserving these approximations.
Example 2 ().
Continuing with Example 1, Figure 2(d) shows the result of the same query as a set UA-DB. When the UA-DB is built, one designated possible world of ADDR is selected, for example the highest ranked option provided by the geocoder. For this example, we select the first option for each ambiguous tuple. The result is based on this one designated possible world, which serves as an over-approximation of the certain answers. A subset of these tuples (addresses 1 and 4) are explicitly labeled as certain. This is the under-approximation: A tuple might still be certain even if it is not labeled as such. We consider the remaining tuples to be “uncertain”. In Figure 2(d), tuples 1 and 4 (resp., 2) are correctly marked as certain (resp., uncertain), while tuple 3 is mis-classified as uncertain even though it appears in all worlds. We stress that even a mislabeled certain answer is still present: a UA-DB sandwiches the certain answers.
Figure 4 overviews our approach. We provide labeling schemes that derive a UA-DB from common incomplete data models. The resulting UA-DB bounds the certain tuples from above and below, a property preserved through queries. UA-DBs are both efficient and precise. We demonstrate efficiency by implementing a bag UA-DB as a query-rewriting front-end on top of a classical relational DBMS: UA-DB queries have minimal performance overhead compared to the same queries on deterministic data. We demonstrate precision both analytically and experimentally. First, under specific conditions, some of which we identify in Section 8, exactly the certain answers will be marked as certain. Second, we show experimentally that even when these conditions do not hold, the fraction of misclassified certain answers is low. Importantly, a wide range of uncertain data models can be translated into UA-DBs through simple and efficient transformations that (i) determine a best-guess world (BGW) and (ii) obtain an under-approximation of the certain answers. We define such transformations for three popular models of incomplete data in Section 4: tuple-independent databases (Suciu et al., 2011), x-DBs (Agrawal et al., 2006) and C-tables (Imielinski and Jr., 1984). In classical incomplete databases, where probabilities are not available, any possible world can serve as a BGW. In probabilistic databases (or any incomplete data model that ranks possible worlds), we preferentially use the possible world with the highest probability (if computationally feasible), or an approximation thereof. We emphasize that our approach does not require enumerating (or even knowing) the full set of possible worlds. As long as some possible world can be obtained, our approach is applicable. In worst case, if no certainty information is available, our approach labels all tuples as uncertain and degrades to classical best-guess query processing. Furthermore, our approach is also applicable in use cases like inconsistent query answering (Arenas et al., 1999) where possible worlds are defined declaratively (e.g., all repairs of an inconsistent database).
We significantly extend the state-of-the-art on under-approximating certain answers (Libkin, 2016; Geerts et al., 2017; Reiter, 1986): (1) we combine an under-approximation with best-guess query processing bounding certain answers from above and below; (2) we support sets, bags, and any other data model expressible as semiring annotations from a large class of semirings; (3) we support translation of a wide range of incomplete and probabilistic data models into our UA-DB model; (4) in contrast to certain answers, UA-DBs are closed under queries.
The remainder of the paper is organized as follows.
Incomplete -Relations. (Section 3) We introduce incomplete -databases, generalizing incomplete databases to -relations (Green et al., 2007). We then define certain annotations as a natural extension of certain answers, based on the observation that certain answers are a lower bound on the content of a world. It is thus natural to define certainty based on a greatest-lower-bound operation (GLB) for semiring annotations based on so-called l-semirings (Kostylev and Buneman, 2012) where the GLB is well behaved. We show that certain annotations correspond to the classical notion of certain answers for set (Lipski, 1979) and bag (Guagliardo and Libkin, 2017) semantics.
UA-DBs. (Section 5) We define UA-DBs as databases that annotate tuples with pairs of annotations from a semiring . The annotation of a tuple in a UA-DB bounds the certain annotation of the tuple from above and below. This is achieved by combining the annotations from one world (the over-approximation) with an under\hypapproximation that we call an uncertainty labeling. Relying on results for under-approximations that we develop in the following sections, we prove that queries over UA-DBs preserve these bounds.
Under\hypapproximating Certain Answers. (Section 6) To better understand under\hypapproximations, we define uncertainty labelings, which are -relations that under\hypapproximate the set of certain tuples for an incomplete -database. An uncertainty labeling is certain- or c-sound (resp., c-complete) if it is a lower (resp., upper) bound on the certain annotations of tuples in a -relation; and c-correct if it is both. We also extend these definitions to query semantics. A query semantics preserves c-soundness if the result of the query is a c-sound labeling for the result of evaluating the query over the input -database from which the labeling was derived.
Queries over Uncertainty Labelings. (Section 7) Since labelings are -relations, we can evaluate queries over such labelings. We demonstrate that evaluating queries in this fashion preserves under\hypapproximations of certain answers, generalizing a previous result for V-tables due to Reiter (Reiter, 1986). That is queries preserve c-soundness. Furthermore, under certain conditions this query semantics returns precisely the certain answers. That is, since all queries preserve c-soundness, under these conditions queries preserve c-correctness.
Implementation for Bag Semantics. (Section 9) We implement UA-DBs on top of a relational DBMS. We extend the schema of relations to label tuples as certain or uncertain (e.g., Figure 2(d)). Queries with UA-relational semantics are compiled into standard relational queries over this encoding. We prove this compilation process to be correct.
Performance. (Section 11) We demonstrate experimentally that UA-DBs outperform state-of-the-art incomplete and probabilistic query processing schemes, and are competitive with deterministic query evaluation and other certain answer under\hypapproximations. Furthermore, for a wide range of real world datasets, comparatively few answers are misclassified by our approach. We also demonstrate that best-guess answers and, hence, also UA-DBs, can have higher utility than certain answers. Finally, we demonstrate the use of UA-DBs for uncertain access control annotations and bag semantics.
2. Notation and Background
A database schema is a set of relation schemas. A relational schema consists of a relation name and a set of attribute names . The arity of a relation schema is the number of attributes in . An instance for database schema is a set of relation instances with one relation for each relation schema in : . Assume a universal domain of attribute values . A tuple with schema is an element from . In this paper, we consider both bag and set semantics. A set relation with schema is a set of tuples with schema , i.e., . A bag relation with schema is a bag (multiset) of tuples with schema . We use TupDom to denote the set of all tuples over domain .
2.1. Possible Worlds Semantics
Incomplete and probabilistic databases model uncertainty and its impact on query results. An incomplete database is a set of deterministic database instances of schema , called possible worlds. We write to denote that a tuple appears in a specific possible world .
Example 3 ().
Decades of research (Suciu et al., 2011; Imielinski and Jr., 1984; Green and Tannen, 2006; Antova et al., 2007; Boulos et al., 2005; Agrawal et al., 2006) has focused on query processing over incomplete databases. These techniques commonly adopt the “possible worlds” semantics: The result of evaluating a query over an incomplete database is the set of relations resulting from evaluating over each possible world individually using deterministic semantics.
Example 4 ().
2.2. Certain and Best-Guess Answers
An important goal of query processing over incomplete databases is to differentiate query results that are certain from ones that are merely possible. Formally, a tuple is certain if it appears in every possible world. (Imielinski and Jr., 1984; Lipski, 1979):
In contrast to (Imielinski and Jr., 1984), which studies certain answers to queries, we define certainty at the instance level. These approaches are equivalent since we can compute the certain answers of query over incomplete instance as . Although computing certain answers is coNP-hard (Abiteboul et al., 1991) in general, there exist PTIME under\hypapproximations (Libkin, 2016; Guagliardo and Libkin, 2016; Reiter, 1986).
Best Guess Query Processing. As mentioned in the introduction, another approach commonly used in practice is to select one possible world. Queries are evaluated solely in this world, and ambiguity is ignored or documented outside of the database. We refer to this approach as best-guess query processing (BGQP) (Yang et al., 2015) since typically one would like to select the possible world that is deemed most likely.
Our generalization of incomplete databases is based on the -relation (Green et al., 2007) framework. In this framework, relations are annotated with elements from the domain of a (commutative) semiring . A commutative semiring is a structure with commutative and associative addition () and product () operations where distributes over . As before, denotes a universal domain. An -nary -relation is a function that maps tuples (elements from ) to elements from . Tuples that are not in the relation are annotated with . Only finitely many tuples may be mapped to an element other than (i.e., relations must be finite). Since -relations are functions from tuples to annotations, it is customary to denote the annotation of a tuple in relation as . The specific information encoded by an annotation depends on the choice of semiring.
Encoding Sets and Bags. Green et al. (Green et al., 2007) demonstrated that bag and set relations can be encoded as commutative semirings: the natural numbers () with addition and multiplication, , annotates each tuple with its multiplicity; and boolean constants with disjunction and conjunction, , annotates each tuple with its set membership. Abusing notation, we denote by and both the domain and the corresponding semiring.
Query Semantics. Operators of the positive relational algebra () over -relations are defined by combining input annotations using operations and .
For simplicity we assume in the definition above that tuples are of a compatible schema (e.g., for a union ). We use to denote a function that returns iff evaluates to true over tuple and otherwise.
Example 5 ().
Figure 7 shows a bag semantics database encoded as an -database, with each tuple annotated with its multiplicity (the copies of in the relation). Annotations appear beside each tuple. Query , below, computes states.
Every input tuple appears once (is annotated with ). The output tuple annotation is computed by multiplying annotations of joined tuples, and summing annotations projected onto the same result tuple. For instance, 2 NY addresses are returned.
In the following, we will make use of homomorphisms. A mapping from a semiring to a semiring is a called a homomorphism if it maps and to their counterparts in and distributes over sum and product (e.g., ). As observed by Green et al. (Green et al., 2007), any semiring homomorphism can be lifted to a homomorphism from -relations to -relations by applying to the annotation of every tuple : . Importantly, queries commute with semiring homomorphisms. That is, given a homomorphism , query , and -database we have . We will abuse syntax and use the same function symbols (e.g., ) to denote mappings between semirings, -relations, as well as -databases.
Example 6 ().
Continuing Example 5, we can derive a set instance through a mapping defined as if and otherwise. is a semiring homomorphism, so evaluating in first and then applying (i.e., ) is equivalent to applying first, and then evaluating .
When defining bounds for annotations in Section 3, we make use of the so called natural order for a semiring , defined as an element preceding if it is possible to obtain by adding to . Semirings for which the natural order is a partial order are called naturally ordered (Geerts and Poggi, 2010).
3. Incomplete K-relations
Many incomplete data models either do not support bag semantics, or distinguish it from set semantics. Our first contribution unifies both under a joint framework. Recall that an incomplete database is a set of deterministic databases (possible worlds). We now generalize this idea to -databases.
Definition 1 (Incomplete -database).
Let be a semiring. An incomplete -database is a set of possible worlds where each is a -database.
Like classical incomplete databases, queries over an incomplete -database use possible world semantics, i.e., the result of evaluating a query over an incomplete -database is the set of all possible worlds derived by evaluating over every possible world (i.e., ).
3.1. Certain Annotations
While possible worlds semantics are directly compatible with incomplete -databases, the same does not hold for the concepts of certain and possible tuples, as we will show in the following. First off, we have to define what precisely do we mean by certain answers over possible worlds that are -databases.
Example 7 ().
Consider a -database (bag semantics) containing a relation LOC with two attributes locale and state. Assume that consists of the two possible worlds below:LOC in locale state Lasalle NY 3 Tucson AZ 2 LOC in locale state
Lasalle NY 2 Tucson AZ 1 Greenville IN 5
Using semiring each tuple in a possible world is annotated with its multiplicity (the number of copies of the tuple that exist in the possible world). Arguably, tuples (Lasalle, NY) and (Tucson, AZ) are certain since they appear (multiplicity higher than ) in both possible worlds while (Greenville, IN) is not since it is not present (its multiplicity is zero) in possible world 222 All tuples not shown in the tables are assumed to be annotated with zero. . However, the boolean interpretation of certainty of incomplete databases is not suited to -relations (or -relations in general) because it ignores the annotations of tuples. In this particular example, tuple (Lasalle, NY) appears with multiplicity in possible world and multiplicity in possible world . We can state with certainty that in every possible world this tuple appears at least twice. Thus, is a lower bound (the greatest lower bound) for the annotation of (Lasalle, NY). Following this logic, we will define certainty through greatest lower bounds (GLBs) on tuple annotations.
To further justify defining certain answers as lower bounds on annotations, consider classical incomplete databases which apply set semantics. Under set semantics, a tuple is certain if it appears in all possible worlds and possible if it appears in at least one possible world. Like the bag semantics example above, certainty (possible) is a lower (upper) bound on a tuple’s annotation across all worlds. Consider the the order . If a tuple exists in every possible world (is always annotated true), then intuitively, the GLB of its annotation across all worlds is true. Otherwise, the tuple is not certain (is annotated false in at least one world), and the GLB is .
To define a sensible lower bound for annotations, we need an order relation for semiring elements. We use the natural order as introduced in Section 2.3 to define the GLB and LUB of a set of -elements. For a well-defined GLB, we require that forms a lattice over , a property that makes an l-semiring (Kostylev and Buneman, 2012). A lattice over a set and with a partial order is a structure where (the greatest lower bound) and (the lowest upper bound) are operations over defined for all as:
The least upper bound is defined symmetrically.
In a lattice, and are associative, commutative, and fulfill
We will use and to denote the and operation of the lattice over for a semiring . Abusing notation, we will apply the and operations to sets of elements from with the understanding that they will be applied iteratively to the elements in the set in some order, e.g., . This is well-defined for l-semirings, since in a lattice any set of elements has a unique greatest lower bound and lowest upper bound based on the associativity and commutativity laws of lattices. That is, no matter in which order we apply to the elements of a set, the result will be the same. From here on, we will limit our discussion to l-semirings. Many semirings, including the set semiring and the bag semiring are l-semirings. The natural order of is , , and . The natural order of is the standard order of natural numbers, , and .
Based on and ,
we define the certain and possible
annotation of a tuple in an incomplete -database by gathering the annotations of tuple from all possible worlds of and then applying to compute the greatest lower bound.
Importantly, GLB coincides with the standard definition of certain answers for set semantics (): returns true only when the tuple is present in all worlds. We also note that , is analogous to the definition of certain answers for bag semantics from (Guagliardo and Libkin, 2016). For instance, consider the certain annotation of the first tuple from Example 7. The tuple’s certain multiplicity is . Similarly, for the third tuple, . Reinterpreted under set semantics, all tuples that exist (multiplicity ) are annotated () and all others (). For the first tuple we get, (certain). For the third tuple we get (not certain).
For the formal exposition in the remainder of this work it will be useful to define an alternative, but equivalent, encoding of an incomplete -database as a single -database using a special class of semirings whose elements encode the annotation of a tuple across a set of possible worlds. This encoding is a technical device that allows us to adopt results from the theory of -relations directly to our problem. We assume a fixed set of possible world identifiers for some number of possible worlds . Given the domain of a semiring , we write to denote the set of elements from the -way cross-product of . We annotate tuples with elements of to store annotations of in each possible world. We use , , … to denote elements from to make explicit that they are vectors.
Definition 2 (Possible World Semiring).
Let be an l-semiring. We define the possible world semiring . The operations of this semiring are defined as follows
Thus, a -database is simply a pivoted representation of an incomplete -database.
Example 8 ().
Reconsider the incomplete -relation from Example 7. The encoding of this database as a -relation is:
Translating between incomplete -databases and -databases is trivial. Given an incomplete -database with possible worlds , we create the corresponding -database by annotating each tuple with the vector . In the other direction, given a -database with vectors of length , we construct the corresponding incomplete -database by annotating each tuple with in possible world . In addition, we will show below that queries over -databases encode possible world semantics. Thus, the following result holds and we can use incomplete - and -databases interchangeably.
Proposition 1 ().
Incomplete -databases and -databases are isomorphic wrt. possible worlds semantics for queries.
Observe that is a semiring, since we define using the -way version of the product operation of universal algebra, and products of semirings are also semirings (Burris and Sankappanavar, 2012).
Possible Worlds. We can extract the -database for a possible world (e.g., the best-guess world) from a -database by projecting on one dimension of its annotations. This can be modeled as a mapping where :
Recall that under possible world semantics, the result of a query is the set of worlds computed by evaluating over each world of the input. As a sanity check, we would like to ensure that query processing over -relations matches this definition. We can state possible world semantics equivalently as follows: the content of a possible world in the query result () is the result of evaluating query over this possible world in the input (): That is, -relations have possible worlds semantics iff commutes with queries:
Recall from Section 2.3 that a mapping between semirings commutes with queries iff it is a semiring homomorphism. Note that -relations admit a trivial extension to probabilistic data by defining a distribution . See ([Anonymized], [n. d.]) for details.
Lemma 1 ().
For any semiring and possible world , mapping is a semiring homomorphism.
Probabilistic Data. -relations admit a trivial extension to probabilistic data by defining a distribution such that . In contrast to classical frameworks for possible worlds, where the collection of worlds is a set, queries preserve the same possible worlds333Although it has no impact on our results, it is worth noting that the worlds in a query result may not be distinct.. Hence, the input distribution applies, unchanged, to the possible query outputs.
Certain and Possible Annotations. Since the annotation of a tuple in a -database is a vector recording ’s annotations in all worlds, certain annotations for incomplete -databases are computed by applying to the set of annotations contained in the vector. Thus, the certain annotation of a tuple from a -DB is computed as:
4. Labeling Schemes
We define efficient (PTIME) labeling schemes for three existing incomplete data models: Tuple-Independent databases (Suciu et al., 2011), the disjoint-independent x-relation model from (Agrawal et al., 2006), and C-Tables (Imielinski and Jr., 1984). We also show how to extract a best-guess world from an -database derived from these models. Since computing certain answers is hard in general, our PTIME labeling schemes cannot be c-correct for all models.
4.1. Labeling Schemes
Tuple-Independent Databases. A tuple\hypindependent database (TI-DB) is a database where each tuple is marked as optional or not. The incomplete database represented by a TI-DB is the set of instances that include all non-optional tuples and some subset of the optional tuples. That is, the existence of a tuple is independent of the existence of any other tuple . In the probabilistic version of TI-DBs each tuple is associated with its marginal probability. The probability of a possible world is then the product of the probability of all tuples included in the world multiplied by the product of for all tuples from that are not part of the possible world. We define a labeling function for TI-DBs that returns a -labeling that annotates a tuple with (certain) iff it is not optional. For probabilistic TI-DBs we label tuples as certain if their marginal probability is .
Theorem 1 ( is c-correct).
Given a TI-DB , is a c-correct labeling.
Trivially holds. An incomplete (probabilistic) database tuple is certain iff it is not optional (if ). ∎
C-tables. C-Tables (Imielinski and Jr., 1984) use a set of variable symbols to define possible worlds. Tuples are annotated by a boolean expression over comparisons of values from , called the local condition. Each variable assignment satisfying a boolean expression called the global condition defines a possible world, derived by retaining only tuples with local conditions satisfied under . Computing certain answers for first order queries is coNP-complete (Vardi, 1986; Abiteboul et al., 1991) even for Codd-tables. Since the result of any first order query over a Codd-table can be represented as a C-table and evaluating a query in this fashion is efficient, it follows that determining whether a tuple is certain in a C-table cannot be in PTIME. Instead, consider the following sufficient, but not necessary condition for a tuple to be certain. If (1) a tuple in a C-table contains only constants and (2) its local condition is a tautology, then the tuple is certain. To see why this is the case, recall that under the closed-world assumption, a C-table represents a set of possible worlds, one for each valuation of the variables appearing in the C-table (to constants from ). A tuple is part of a possible world corresponding to such a valuation if the tuple’s local condition is satisfied under the valuation. Thus, a tuple consisting of constants only, with a local condition that is a tautology is part of every possible world represented by the C-table. If the local condition of a tuple is in conjunctive normal form (CNF) then checking whether it is a tautology is efficient (PTIME). Our labeling scheme for C-tables applies this sufficient condition and, thus, is c-sound. Formally, , where for a C-table and any tuple :
Green et. al. (Green and Tannen, 2006) introduced PC-tables a probabilistic version of C-tables where each variable is associated with a probability distribution over its possible values. Variables are considered independent of each other, i.e., the probability of a possible world is computed as the product of the probabilities of the individual variable assignments based on which the world was created. Our labeling scheme works for both the incomplete and probabilistic version of C-tables.
Theorem 2 ( is c-sound).
Given an incomplete database encoded as C-tables, is c-sound.
Note that is not guaranteed to be c-correct. For instance, a tuple consisting only of constants and for which is a tautology is guaranteed to be certain, but if is not in CNF.
Example 9 ().
Consider a C-table consisting of two tuples with and with . would mark as uncertain, because even though this tuple exists in the C-table and it’s local condition is in CNF, the local condition is not a tautology. However, tuple is certain since either and then first tuple evaluates to or and the second tuple is included in the possible world.
x-DBs. An x-DB (Agrawal et al., 2006) is a set of x-relations, which are sets of x-tuples. An x-tuple is a set of tuples with a label indicating whether the x-tuple is optional. Each x-tuple is assumed to be independent of the others, and its alternatives are assumed to be disjoint. Thus, a possible world of an x-relation is constructed by selecting at most one alternative for every x-tuple from if is optional, or exactly one if it is not optional. The probabilistic version of x-DBs (also called a Block-Independent or BI-DB) as introduced in (Agrawal et al., 2006) assigns each alternative a probability and we require that . Thus, a tuple is optional if and there is no need to use labels to mark optional tuples. We use to denote the number of alternatives of x-tuple . We define a labeling scheme for x-relations that annotates a tuple from an x-DB with if is the single, non-optional alternative of an x-tuple, and otherwise. In probabilistic x-DBs we check .
Theorem 3 ( is c-correct).
Given a database , is a c-correct labeling.
4.2. Extracting best-guess worlds
Computing some possible world is trivial for most incomplete and probabilistic data models. However, for the case of probabilistic data models we are particularly interested in the highest-probability world (the best guess world). We now discuss in more detail how we choose the BGW for the data models for which we have introduced labeling schemes above.
TI-DB. For a TI-DB , the best guess world consists of all tuples such that . To understand why this is the case recall that the probability of a world from a TI-DB is the product of the probabilities of included tuples with one minus the probability of excluded tuples. This probability is maximized by including only tuples where . For the incomplete version of TI-DBs we have to include all non-optional tuples and can choose arbitrarily which optional tuples to include in .
PC-tables. For a PC-table, computing the most likely possible world reduces to answering a query over the database, which is known to be #P in general (Suciu et al., 2011). Specific tables (e.g., those generated by “safe” queries (Suciu et al., 2011)) admit PTIME solutions. Alternatively, there exist a wide range of algorithms (Gatterbauer and Suciu, 2017; Fink et al., 2011, 2013; Li et al., 2011) that can be used to compute an arbitrarily close approximation of the most likely world.
Disjoint-independent databases. Since the x-tuples in an x-DB are independent of each other, the probability of a possible world from an x-DB is maximized by including for every x-tuple its alternative with the highest probability or no alternative if , i.e., if the probability of not including any alternative for the x-tuple is higher than the highest probability of an alternative for the x-tuple.
We now introduce UA-DBs (uncertainty-annotated databases) which encode both under\hyp and over\hypapproximations of the certain annotations of an incomplete -database . This is achieved by annotating every tuple with a pair where records the tuple’s annotation in the BGW (, for some ) and stores the under-approximation of the tuple’s certain annotation (i.e., ). Both under\hyp and over\hypapproximations of certain annotations assign tuples annotations from , making them -databases. This will be important for proving that these bounds are preserved under queries. Every possible world is by definition a superset of the certain tuples, so a UA-DB contains all certain answers, even though the certainty of some answers may be underestimated. We start by formally defining the annotation domains of UA-DBs and mappings that extract the two components of an annotation. Afterwards, we state the main result of this section: queries over UA-DBs preserve the under\hyp and over\hypapproximation of certain annotations.
We define a UA-semiring as a -semiring, i.e., the direct product of a semiring with itself (see Section 5.1). inlineinlinetodo: inlineBoris says: Afterwards, we prove that the result of a query over a UA-DB encodes the result of the query over the input possible world and uncertainty labeling. In the following we will write instead of if the semiring is clear from the context. Recall that operations in are defined pointwise, e.g., .
Definition 3 (UA-semiring).
Let be a semiring. We define the corresponding UA-semiring
Note that for any , is a semiring, because, as mentioned earlier, products of semirings are semirings.
5.2. Creating UA-DBs
We now discuss how to derive UA-relations from a -database or a compact encoding of a -database using some uncertain data model like c-tables. Consider a -database , let be one of its worlds and a -database under\hypapproximating the certain annotations of . We refer to as a labeling and will study such labelings in depth in Section 6 and 7. We cover in Section 4 how to generate a UA-DB from common uncertain data models by extracting a (best-guess) world and a labeling . We construct a UA-DB as an encoding of and by setting for every tuple :
For a UA-DB constructed in this fashion we say that approximates by encoding . Given a UA-DB , we would like to be able to restore and from . For that we define two morphisms :
Note that by construction, if an UA-DB is an encoding of a possible world and a labeling of a -database then: .
5.3. Querying UA-DBs
We now state the main result of this section: query evaluation over UA-DBs preserves the under\hypapproximation and over\hypapproximation of certain annotations. To prove the main result, we first show that and are homomorphisms, because this implies that queries over UA-DBs are evaluated over the and the component of an annotation independently. Thus, we can prove the result for under\hyp and over\hypapproximations separately. For over\hypapproximation we can trivially show an even better result: By definition (Section 3.2) the possible world used as an over-approximation is preserved exactly. Hence, the over-approximation property is preserved and UA-DBs are also backwards compatible with BGQP. For under\hypapproximations we have to show that query evaluation preserves under\hypapproximations. This part is more involved and we will prove this result in Section 7.
Theorem 4 (Queries Preserve Bounds).
Let be a -database, a labeling for , one of its possible worlds, and be the UA-DB encoding the pair . Clearly approximates . Then is an approximation for encoding the pair .
6. Uncertainty Labelings
We now define uncertainty labelings, which are -databases whose annotations over\hyp or under\hypapproximate certain annotations of tuples in a -database with respect to the natural order of semiring . A labeling scheme is a mapping from an incomplete databases to labelings.
Definition 4 (Uncertainty Labeling Scheme).
Let be the set of all -databases, an incomplete/probabilistic data model, and the set of all possible instances of this model. An uncertainty labeling scheme is a function such that the labeling has the schema .
Ideally, we would like the label (annotation) of a tuple from an uncertainty labeling to be exactly . Observe that an exact labeling can always be computed in time if all worlds of the incomplete database can be enumerated. However, the number of possible worlds is frequently exponential in the data size. Thus, most incomplete data models rely on factorized encodings, with size typically logarithmic in . Ideally, we would like labeling schemes to be PTIME in the size of the encoding (rather than in ). As mentioned in the introduction, computing certain answers is coNP-complete, so for tractable query semantics we must accept that may either over\hyp or under\hypapproximate (with respect to ). For instance, under bag semantics (semiring ), a label may be smaller or larger than the certain multiplicity of a tuple. We call a labeling c-sound (no false positives) if it consistently under\hypapproximates the certain annotation of tuples, c-complete (no false negatives) if it consistently over\hypapproximates certainty, and c-correct if it annotates every tuple with its certain annotation. We also apply this terminology to labeling schemes, e.g., a c-sound labeling scheme only produces c-sound labelings. For UA-DBs we are mainly interested in c-sound labeling schemes to provide an under\hypapproximation of certain annotations.
Definition 5 ().
If is an uncertainty labeling for .
|We call …||…iff for all tuples …|
A labeling is both c-sound and c-complete iff it is c-correct. Ideally, queries over labelings would preserve these bounds.
Definition 6 (Preservation of Bounds).
A query semantics for uncertainty labelings preserves a property (c-soundness, c-completeness, or c-correctness) wrt. a class of queries , if for any incomplete database , labeling for that has property , and query we have: is an uncertainty labeling for with property .
7. Querying Labelings
We now study whether queries over labelings produced by labeling schemes such as the ones described in Section 4 preserve c-soundness. Specifically, we demonstrate that standard -relational query evaluation preserves c-soundness for any c-sound labeling scheme. Recall that a query semantics for labelings preserves c-soundness if a query evaluated on a c-sound labeling of incomplete database is a c-sound labeling for . Our result generalizes a previous result of Reiter (Reiter, 1986) to any type of incomplete -database for which we can define an efficient c-sound labeling scheme. We need the following lemma, to show that the natural order of a semiring factors through addition and multiplication. This is a known result that we only state for completeness.
Lemma 2 ().
Let be a naturally ordered semiring. For all we have:
7.1. Preservation of C-Soundness
We now prove that over labelings preserves c\hypsoundness. Since queries over both -databases and labelings have -relational query semantics, we can make use of the fact that over -relations is defined using and . At a high level, the argument is as follows: (1) we show that applied to the result of an addition (or multiplication) of two -elements and yields a larger (wrt. ) result than adding (or multiplying) the result of applying to and ; (2) Since c-sound labelings for an input provide a lower bound on , we can apply Lemma 2 to show that the query result over a c-sound (or c-correct) labeling is a lower bound for of the result of the query. Combining arguments, we get preservation of c-soundness.
Functions that have the property mentioned in (1) are called superadditive and supermultiplicative. Formally, a function where and are closed under addition and multiplication, and is ordered (order ) is superadditive (supermultiplicative) iff for all :
In a nutshell, if we are given a c-sound -labeling, then evaluating any -query over the labeling using -relational query semantics preserves c-soundness if we can prove that is superadditive and supermultiplicative.
Lemma 3 ().
Let be a semiring. is superadditive and supermultiplicative wrt. the natural order .
Using the superadditivity and -multiplicativity of , we now prove preservation of c-soundness. We first prove a restricted version of this result.
Lemma 4 ().
Let be a -database and be a c-correct -labeling for . queries over preserve c-soundness.
The major drawback of Lemma 4 is that it is limited to c-correct input labelings. Next, we show that c-soundness is still preserved even if the input labeling is only c-sound.
Theorem 5 ().
Let be a -database and a c-sound labeling for . queries over preserve c-soundness.
In Appendix 8 we demonstrate that under certain circumstances, queries also preserve c-completeness.
8. Preservation of C-Completeness
TI-DBs. We now demonstrate that positive queries preserve c-completeness if the input is a labeling produced by the c-complete labeling scheme (Section 4). To show this, we observe that the existence of a world for in which two -elements and are both minimal then commutes with addition and multiplication, and standard -relational semantics preserve c-completeness.
Lemma 5 ().
Let for some possible world semiring . If there exists such that and , then the following holds:
To demonstrate c-completeness preservation for TI-DBs we have to demonstrate that the encoding of a TI-DB as a -database fulfills the precondition of Lemma 5.
Lemma 6 ().
Let be a -database that represents a TI-DB. Then there exists such that for any tuple :
Consider the possible world defined as follows:
This world exists, because in a TI-DB all tuples with probability have annotation in all worlds. Furthermore, since the tuples are independent events, there must exist one world containing no tuples with probability . Let denote the identifier of this world and denote by . (Case 1) and so . (Case 2) and . Because , it follows that . As a result, ∎
Corollary 1 ().
Let be a labeling for a TI-DB computed as . Then over preserves c-completeness.
x-DBs. In general, queries over labelings derived from x-DBs using our labeling scheme from Section 4 do not preserve c-completeness. We present a sufficient condition for a query to preserve c-completeness over such a labeling. To this end, we define x-keys, constraints that ensure that alternatives within the scope of an x-tuple are not all identical if projected on a set of attributes . Since our labeling scheme for x-DBs is c-complete, queries preserve c-completeness unless a result tuple that is certain is derived from multiple correlated uncertain input tuples. Since x-tuples from an x-DB are independent of each other, this can only be the case if a result tuple is derived from alternatives of an x-tuple from every possible world (i.e., where is not optional). Such a situation can be avoided if it is guaranteed that it is impossible for a result tuple to be derived from all alternatives of an x-tuple.
Definition 7 (x-key).
Let be an x-relation with schema . A set of attributes is called an x-key for iff