Uncertainty Annotated Databases  A Lightweight Approach for Approximating Certain Answers
Uncertainty Annotated Databases  A Lightweight Approach for Approximating Certain Answers
(extended version)
Abstract.
Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may exclude useful (if uncertain) answers. Thus, users frequently resort to less principled approaches to resolve the uncertainty. In this paper, we propose Uncertainty Annotated Databases (UADBs), which combine an under and overapproximation of certain answers to achieve the reliability of certain answers, with the performance of a classical database system. Furthermore, in contrast to prior work on certain answers, UADBs achieve a higher utility by including some (explicitly marked) answers that are not certain. UADBs are based on incomplete Krelations, which we introduce to generalize the classical setbased notions of incomplete databases and certain answers to a much larger class of data models. Using an implementation of our approach, we demonstrate experimentally that it efficiently produces tight approximations of certain answers that are of high utility.
tabsize=2, basicstyle=, language=SQL, morekeywords=PROVENANCE,BASERELATION,INFLUENCE,COPY,ON,TRANSPROV,TRANSSQL,TRANSXML,CONTRIBUTION,COMPLETE,TRANSITIVE,NONTRANSITIVE,EXPLAIN,SQLTEXT,GRAPH,IS,ANNOT,THIS,XSLT,MAPPROV,cxpath,OF,TRANSACTION,SERIALIZABLE,COMMITTED,INSERT,INTO,WITH,SCN,UPDATED,LENS,SCHEMA_MATCHING,string,WINDOW,max,OVER,PARTITION,FIRST_VALUE,WITH, extendedchars=false, keywordstyle=, mathescape=true, escapechar=@, sensitive=true tabsize=2, basicstyle=, language=SQL, morekeywords=PROVENANCE,BASERELATION,INFLUENCE,COPY,ON,TRANSPROV,TRANSSQL,TRANSXML,CONTRIBUTION,COMPLETE,TRANSITIVE,NONTRANSITIVE,EXPLAIN,SQLTEXT,GRAPH,IS,ANNOT,THIS,XSLT,MAPPROV,cxpath,OF,TRANSACTION,SERIALIZABLE,COMMITTED,INSERT,INTO,WITH,SCN,UPDATED, extendedchars=false, keywordstyle=, deletekeywords=count,min,max,avg,sum, keywords=[2]count,min,max,avg,sum, keywordstyle=[2], stringstyle=, commentstyle=, mathescape=true, escapechar=@, sensitive=true basicstyle=, language=prolog tabsize=3, basicstyle=, language=c, morekeywords=if,else,foreach,case,return,in,or, extendedchars=true, mathescape=true, literate=:=1 ¡=1 !=1 append1 calP2, keywordstyle=, escapechar=&, numbers=left, numberstyle=, stepnumber=1, numbersep=5pt, tabsize=3, basicstyle=, language=xml, extendedchars=true, mathescape=true, escapechar=£, tagstyle=, usekeywordsintag=true, morekeywords=alias,name,id, keywordstyle= tabsize=3, basicstyle=, language=xml, extendedchars=true, mathescape=true, escapechar=£, tagstyle=, usekeywordsintag=true, morekeywords=alias,name,id, keywordstyle=
tabsize=2, basicstyle=, language=SQL, morekeywords=PROVENANCE,BASERELATION,INFLUENCE,COPY,ON,TRANSPROV,TRANSSQL,TRANSXML,CONTRIBUTION,COMPLETE,TRANSITIVE,NONTRANSITIVE,EXPLAIN,SQLTEXT,GRAPH,IS,ANNOT,THIS,XSLT,MAPPROV,cxpath,OF,TRANSACTION,SERIALIZABLE,COMMITTED,INSERT,INTO,WITH,SCN,UPDATED,LENS,SCHEMA_MATCHING,string,WINDOW,max,OVER,PARTITION,FIRST_VALUE,WITH, extendedchars=false, keywordstyle=, mathescape=true, escapechar=@, sensitive=true
tabsize=2, basicstyle=, language=SQL, morekeywords=PROVENANCE,BASERELATION,INFLUENCE,COPY,ON,TRANSPROV,TRANSSQL,TRANSXML,CONTRIBUTION,COMPLETE,TRANSITIVE,NONTRANSITIVE,EXPLAIN,SQLTEXT,GRAPH,IS,ANNOT,THIS,XSLT,MAPPROV,cxpath,OF,TRANSACTION,SERIALIZABLE,COMMITTED,INSERT,INTO,WITH,SCN,UPDATED, extendedchars=false, keywordstyle=, deletekeywords=count,min,max,avg,sum, keywords=[2]count,min,max,avg,sum, keywordstyle=[2], stringstyle=, commentstyle=, mathescape=true, escapechar=@, sensitive=true
basicstyle=, language=prolog
tabsize=3, basicstyle=, language=c, morekeywords=if,else,foreach,case,return,in,or, extendedchars=true, mathescape=true, literate=:=1 ¡=1 !=1 append1 calP2, keywordstyle=, escapechar=&, numbers=left, numberstyle=, stepnumber=1, numbersep=5pt,
tabsize=3, basicstyle=, language=xml, extendedchars=true, mathescape=true, escapechar=£, tagstyle=, usekeywordsintag=true, morekeywords=alias,name,id, keywordstyle=
tabsize=3, basicstyle=, language=xml, extendedchars=true, mathescape=true, escapechar=£, tagstyle=, usekeywordsintag=true, morekeywords=alias,name,id, keywordstyle=
1. Introduction
Data uncertainty arises naturally in applications like sensing (Letchner et al., 2009), data exchange (Fagin et al., 2011), distributed computing (Lang et al., 2014), data cleaning (Chu et al., 2015), and many others. Incomplete (Imielinski and Jr., 1984) and probabilistic databases (Suciu et al., 2011) have emerged as a principled way to deal with uncertainty. Both types of databases consist of a set of deterministic instances called possible worlds that represent possible interpretations of data available about the real world. An often cited, conservative approach to uncertainty is to consider only certain answers (Abiteboul et al., 1991; Imielinski and Jr., 1984) (answers in all possible worlds). However, this approach has two problems. First, computing certain answers is expensive^{1}^{1}1coNPcomplete (Abiteboul et al., 1991; Imielinski and Jr., 1984) (data complexity) for firstorder queries over Vtables (Imielinski and Jr., 1984), as well as for conjunctive queries for, e.g., ORdatabases (Imielinski et al., 1995).. Furthermore, requiring answers to be certain may unnecessarily exclude useful, possible answers. Thus, users instead resort to what we term bestguess query processing (BGQP): making an educated guess about which possible world to use (i.e., how to interpret available data) and then working exclusively with this world. BGQP is more efficient than certain answers, and generally includes more useful results. However, information about uncertainty in the data is lost, and all query results produced by BGQP are consequently suspect.
Previous work has also explored approximations of certain answers (Libkin, 2016; Geerts et al., 2017; Reiter, 1986). Under the premise that missing a certain answer is better than incorrectly reporting an answer as certain, such work focuses on under\hypapproximating certain answers. This addresses the performance problem, but under\hypapproximations only exacerbate the problem of excluded results. Worse, these techniques are limited to specific uncertain data models such as Vtables, and with the exception of a brief discussion in (Guagliardo and Libkin, 2017), only support set semantics.




Example 1 ().
Geocoders translate natural language descriptions of locations into coordinates (i.e., latitude and longitude). Consider the ADDR and LOC relations in Figure 2. Tuples 2 and 3 of ADDR each have an ambiguous geocoding. This is an xtable (Agrawal et al., 2006), a type of incomplete data model where each tuple may have multiple alternatives. Each possible world is defined by some combination of alternatives (e.g., ADDR encodes 4 possible worlds). An analyst might use a spatial join with a lookup table (LOC) to map coordinates to geographic regions. Figure 2(a) shows the result of the following query in one world.
The certain answers to this query (Figure 2(b)) are tuples that appear in the result, regardless of which world is queried. Figure 2(c) shows all possible answers that could be returned for some choice of geocodings. Note also that ambiguous answers (e.g., address 2) may not be certain, but may still be useful.
Ideally, we would like an approach that (1) generalizes to a wide range of data models, (2) is easy to use like BGQP, (3) is compatible with a wide of probabilistic and incomplete data representations (e.g., tupleindependent databases (Suciu et al., 2011), Ctables (Imielinski and Jr., 1984), and xDBs (Agrawal et al., 2006)) and sources of uncertainty (e.g., inconsistent databases (Koutris and Wijsen, 2018; Kolaitis et al., 2013; Kolaitis and Pema, 2012; Bertossi, 2011; Fuxman and Miller, 2005; Arenas et al., 1999), imputation of missing values, and more), and (4) is principled like certain answers. We address the generality requirement (1) by rethinking incomplete data management in terms of Green et. al.’s database framework (Green et al., 2007). In this framework, each tuple is annotated with an value from a semiring . Choosing an appropriate semiring, databases can encode a wide range of query processing semantics including classical set and bagsemantics, as well as query processing with access control, provenance, and more. Our primary contribution here is to identify a natural, backwardscompatible generalization of certain answers to a broad class of databases.
Our second major contribution is to combine an under\hypapproximation of certain answers with bestguess query processing to create an UncertaintyAnnotated Database (UADB). A UADB is built around one distinguished possible world of an incomplete database, for instance the “bestguess” world that would normally be used in practice. This world serves as an over\hypapproximation of certain answers. Tuples from this world are labeled as either certain or uncertain to encode an under\hypapproximation of certain answers. As illustrated in Figure 1, a UADB sandwiches the certain answers between under and overapproximations. A lightweight (extensional (Suciu et al., 2011)) query evaluation semantics then propagates labels while preserving these approximations.
Example 2 ().
Continuing with Example 1, Figure 2(d) shows the result of the same query as a set UADB. When the UADB is built, one designated possible world of ADDR is selected, for example the highest ranked option provided by the geocoder. For this example, we select the first option for each ambiguous tuple. The result is based on this one designated possible world, which serves as an overapproximation of the certain answers. A subset of these tuples (addresses 1 and 4) are explicitly labeled as certain. This is the underapproximation: A tuple might still be certain even if it is not labeled as such. We consider the remaining tuples to be “uncertain”. In Figure 2(d), tuples 1 and 4 (resp., 2) are correctly marked as certain (resp., uncertain), while tuple 3 is misclassified as uncertain even though it appears in all worlds. We stress that even a mislabeled certain answer is still present: a UADB sandwiches the certain answers.
Figure 4 overviews our approach. We provide labeling schemes that derive a UADB from common incomplete data models. The resulting UADB bounds the certain tuples from above and below, a property preserved through queries. UADBs are both efficient and precise. We demonstrate efficiency by implementing a bag UADB as a queryrewriting frontend on top of a classical relational DBMS: UADB queries have minimal performance overhead compared to the same queries on deterministic data. We demonstrate precision both analytically and experimentally. First, under specific conditions, some of which we identify in Section 8, exactly the certain answers will be marked as certain. Second, we show experimentally that even when these conditions do not hold, the fraction of misclassified certain answers is low. Importantly, a wide range of uncertain data models can be translated into UADBs through simple and efficient transformations that (i) determine a bestguess world (BGW) and (ii) obtain an underapproximation of the certain answers. We define such transformations for three popular models of incomplete data in Section 4: tupleindependent databases (Suciu et al., 2011), xDBs (Agrawal et al., 2006) and Ctables (Imielinski and Jr., 1984). In classical incomplete databases, where probabilities are not available, any possible world can serve as a BGW. In probabilistic databases (or any incomplete data model that ranks possible worlds), we preferentially use the possible world with the highest probability (if computationally feasible), or an approximation thereof. We emphasize that our approach does not require enumerating (or even knowing) the full set of possible worlds. As long as some possible world can be obtained, our approach is applicable. In worst case, if no certainty information is available, our approach labels all tuples as uncertain and degrades to classical bestguess query processing. Furthermore, our approach is also applicable in use cases like inconsistent query answering (Arenas et al., 1999) where possible worlds are defined declaratively (e.g., all repairs of an inconsistent database).
We significantly extend the stateoftheart on underapproximating certain answers (Libkin, 2016; Geerts et al., 2017; Reiter, 1986): (1) we combine an underapproximation with bestguess query processing bounding certain answers from above and below; (2) we support sets, bags, and any other data model expressible as semiring annotations from a large class of semirings; (3) we support translation of a wide range of incomplete and probabilistic data models into our UADB model; (4) in contrast to certain answers, UADBs are closed under queries.
The remainder of the paper is organized as follows.
Incomplete Relations. (Section 3) We introduce incomplete databases, generalizing incomplete databases to relations (Green et al., 2007). We then define certain annotations as a natural extension of certain answers, based on the observation that certain answers are a lower bound on the content of a world. It is thus natural to define certainty based on a greatestlowerbound operation (GLB) for semiring annotations based on socalled lsemirings (Kostylev and Buneman, 2012) where the GLB is well behaved. We show that certain annotations correspond to the classical notion of certain answers for set (Lipski, 1979) and bag (Guagliardo and Libkin, 2017) semantics.
UADBs. (Section 5) We define UADBs as databases that annotate tuples with pairs of annotations from a semiring . The annotation of a tuple in a UADB bounds the certain annotation of the tuple from above and below. This is achieved by combining the annotations from one world (the overapproximation) with an under\hypapproximation that we call an uncertainty labeling. Relying on results for underapproximations that we develop in the following sections, we prove that queries over UADBs preserve these bounds.
Under\hypapproximating Certain Answers. (Section 6) To better understand under\hypapproximations, we define uncertainty labelings, which are relations that under\hypapproximate the set of certain tuples for an incomplete database. An uncertainty labeling is certain or csound (resp., ccomplete) if it is a lower (resp., upper) bound on the certain annotations of tuples in a relation; and ccorrect if it is both. We also extend these definitions to query semantics. A query semantics preserves csoundness if the result of the query is a csound labeling for the result of evaluating the query over the input database from which the labeling was derived.
Queries over Uncertainty Labelings. (Section 7) Since labelings are relations, we can evaluate queries over such labelings. We demonstrate that evaluating queries in this fashion preserves under\hypapproximations of certain answers, generalizing a previous result for Vtables due to Reiter (Reiter, 1986). That is queries preserve csoundness. Furthermore, under certain conditions this query semantics returns precisely the certain answers. That is, since all queries preserve csoundness, under these conditions queries preserve ccorrectness.
Implementation for Bag Semantics. (Section 9) We implement UADBs on top of a relational DBMS. We extend the schema of relations to label tuples as certain or uncertain (e.g., Figure 2(d)). Queries with UArelational semantics are compiled into standard relational queries over this encoding. We prove this compilation process to be correct.
Performance. (Section 11) We demonstrate experimentally that UADBs outperform stateoftheart incomplete and probabilistic query processing schemes, and are competitive with deterministic query evaluation and other certain answer under\hypapproximations. Furthermore, for a wide range of real world datasets, comparatively few answers are misclassified by our approach. We also demonstrate that bestguess answers and, hence, also UADBs, can have higher utility than certain answers. Finally, we demonstrate the use of UADBs for uncertain access control annotations and bag semantics.
2. Notation and Background
A database schema is a set of relation schemas. A relational schema consists of a relation name and a set of attribute names . The arity of a relation schema is the number of attributes in . An instance for database schema is a set of relation instances with one relation for each relation schema in : . Assume a universal domain of attribute values . A tuple with schema is an element from . In this paper, we consider both bag and set semantics. A set relation with schema is a set of tuples with schema , i.e., . A bag relation with schema is a bag (multiset) of tuples with schema . We use TupDom to denote the set of all tuples over domain .
2.1. Possible Worlds Semantics
Incomplete and probabilistic databases model uncertainty and its impact on query results. An incomplete database is a set of deterministic database instances of schema , called possible worlds. We write to denote that a tuple appears in a specific possible world .
Example 3 ().


Decades of research (Suciu et al., 2011; Imielinski and Jr., 1984; Green and Tannen, 2006; Antova et al., 2007; Boulos et al., 2005; Agrawal et al., 2006) has focused on query processing over incomplete databases. These techniques commonly adopt the “possible worlds” semantics: The result of evaluating a query over an incomplete database is the set of relations resulting from evaluating over each possible world individually using deterministic semantics.
(1) 
Example 4 ().


2.2. Certain and BestGuess Answers
An important goal of query processing over incomplete databases is to differentiate query results that are certain from ones that are merely possible. Formally, a tuple is certain if it appears in every possible world. (Imielinski and Jr., 1984; Lipski, 1979):
(2)  
(3) 
In contrast to (Imielinski and Jr., 1984), which studies certain answers to queries, we define certainty at the instance level. These approaches are equivalent since we can compute the certain answers of query over incomplete instance as . Although computing certain answers is coNPhard (Abiteboul et al., 1991) in general, there exist PTIME under\hypapproximations (Libkin, 2016; Guagliardo and Libkin, 2016; Reiter, 1986).
Best Guess Query Processing. As mentioned in the introduction, another approach commonly used in practice is to select one possible world. Queries are evaluated solely in this world, and ambiguity is ignored or documented outside of the database. We refer to this approach as bestguess query processing (BGQP) (Yang et al., 2015) since typically one would like to select the possible world that is deemed most likely.
2.3. Krelations
Our generalization of incomplete databases is based on the relation (Green et al., 2007) framework. In this framework, relations are annotated with elements from the domain of a (commutative) semiring . A commutative semiring is a structure with commutative and associative addition () and product () operations where distributes over . As before, denotes a universal domain. An nary relation is a function that maps tuples (elements from ) to elements from . Tuples that are not in the relation are annotated with . Only finitely many tuples may be mapped to an element other than (i.e., relations must be finite). Since relations are functions from tuples to annotations, it is customary to denote the annotation of a tuple in relation as . The specific information encoded by an annotation depends on the choice of semiring.
Encoding Sets and Bags. Green et al. (Green et al., 2007) demonstrated that bag and set relations can be encoded as commutative semirings: the natural numbers () with addition and multiplication, , annotates each tuple with its multiplicity; and boolean constants with disjunction and conjunction, , annotates each tuple with its set membership. Abusing notation, we denote by and both the domain and the corresponding semiring.



Query Semantics. Operators of the positive relational algebra () over relations are defined by combining input annotations using operations and .
Union:  
Join:  
Projection:  
Selection: 
For simplicity we assume in the definition above that tuples are of a compatible schema (e.g., for a union ). We use to denote a function that returns iff evaluates to true over tuple and otherwise.
Example 5 ().
Figure 7 shows a bag semantics database encoded as an database, with each tuple annotated with its multiplicity (the copies of in the relation). Annotations appear beside each tuple. Query , below, computes states.
Every input tuple appears once (is annotated with ). The output tuple annotation is computed by multiplying annotations of joined tuples, and summing annotations projected onto the same result tuple. For instance, 2 NY addresses are returned.
In the following, we will make use of homomorphisms. A mapping from a semiring to a semiring is a called a homomorphism if it maps and to their counterparts in and distributes over sum and product (e.g., ). As observed by Green et al. (Green et al., 2007), any semiring homomorphism can be lifted to a homomorphism from relations to relations by applying to the annotation of every tuple : . Importantly, queries commute with semiring homomorphisms. That is, given a homomorphism , query , and database we have . We will abuse syntax and use the same function symbols (e.g., ) to denote mappings between semirings, relations, as well as databases.
Example 6 ().
Continuing Example 5, we can derive a set instance through a mapping defined as if and otherwise. is a semiring homomorphism, so evaluating in first and then applying (i.e., ) is equivalent to applying first, and then evaluating .
When defining bounds for annotations in Section 3, we make use of the so called natural order for a semiring , defined as an element preceding if it is possible to obtain by adding to . Semirings for which the natural order is a partial order are called naturally ordered (Geerts and Poggi, 2010).
(4) 
3. Incomplete Krelations
Many incomplete data models either do not support bag semantics, or distinguish it from set semantics. Our first contribution unifies both under a joint framework. Recall that an incomplete database is a set of deterministic databases (possible worlds). We now generalize this idea to databases.
Definition 1 (Incomplete database).
Let be a semiring. An incomplete database is a set of possible worlds where each is a database.
Like classical incomplete databases, queries over an incomplete database use possible world semantics, i.e., the result of evaluating a query over an incomplete database is the set of all possible worlds derived by evaluating over every possible world (i.e., ).
3.1. Certain Annotations
While possible worlds semantics are directly compatible with incomplete databases, the same does not hold for the concepts of certain and possible tuples, as we will show in the following. First off, we have to define what precisely do we mean by certain answers over possible worlds that are databases.
Example 7 ().
Consider a database (bag semantics) containing a relation LOC with two attributes locale and state. Assume that consists of the two possible worlds below:
LOC in locale state Lasalle NY 3 Tucson AZ 2 LOC in locale stateLasalle NY 2 Tucson AZ 1 Greenville IN 5
Using semiring each tuple in a possible world is annotated with its multiplicity (the number of copies of the tuple that exist in the possible world). Arguably, tuples (Lasalle, NY) and (Tucson, AZ) are certain since they appear (multiplicity higher than ) in both possible worlds while (Greenville, IN) is not since it is not present (its multiplicity is zero) in possible world ^{2}^{2}2 All tuples not shown in the tables are assumed to be annotated with zero. . However, the boolean interpretation of certainty of incomplete databases is not suited to relations (or relations in general) because it ignores the annotations of tuples. In this particular example, tuple (Lasalle, NY) appears with multiplicity in possible world and multiplicity in possible world . We can state with certainty that in every possible world this tuple appears at least twice. Thus, is a lower bound (the greatest lower bound) for the annotation of (Lasalle, NY). Following this logic, we will define certainty through greatest lower bounds (GLBs) on tuple annotations.
To further justify defining certain answers as lower bounds on annotations, consider classical incomplete databases which apply set semantics. Under set semantics, a tuple is certain if it appears in all possible worlds and possible if it appears in at least one possible world. Like the bag semantics example above, certainty (possible) is a lower (upper) bound on a tuple’s annotation across all worlds. Consider the the order . If a tuple exists in every possible world (is always annotated true), then intuitively, the GLB of its annotation across all worlds is true. Otherwise, the tuple is not certain (is annotated false in at least one world), and the GLB is .
To define a sensible lower bound for annotations, we need an order relation for semiring elements. We use the natural order as introduced in Section 2.3 to define the GLB and LUB of a set of elements. For a welldefined GLB, we require that forms a lattice over , a property that makes an lsemiring (Kostylev and Buneman, 2012). A lattice over a set and with a partial order is a structure where (the greatest lower bound) and (the lowest upper bound) are operations over defined for all as:
The least upper bound is defined symmetrically.
In a lattice, and are associative, commutative, and fulfill
We will use and to denote the and operation of the lattice over for a semiring . Abusing notation, we will apply the and operations to sets of elements from with the understanding that they will be applied iteratively to the elements in the set in some order, e.g., . This is welldefined for lsemirings, since in a lattice any set of elements has a unique greatest lower bound and lowest upper bound based on the associativity and commutativity laws of lattices. That is, no matter in which order we apply to the elements of a set, the result will be the same. From here on, we will limit our discussion to lsemirings. Many semirings, including the set semiring and the bag semiring are lsemirings. The natural order of is , , and . The natural order of is the standard order of natural numbers, , and .
Based on and ,
we define the certain and possible
annotation of a tuple in an incomplete database by gathering the annotations of tuple from all possible worlds of and then applying to compute the greatest lower bound.
Importantly, GLB coincides with the standard definition of certain answers for set semantics (): returns true only when the tuple is present in all worlds. We also note that , is analogous to the definition of certain answers for bag semantics from (Guagliardo and Libkin, 2016). For instance, consider the certain annotation of the first tuple from Example 7. The tuple’s certain multiplicity is . Similarly, for the third tuple, . Reinterpreted under set semantics, all tuples that exist (multiplicity ) are annotated () and all others (). For the first tuple we get, (certain). For the third tuple we get (not certain).
3.2. relations
For the formal exposition in the remainder of this work it will be useful to define an alternative, but equivalent, encoding of an incomplete database as a single database using a special class of semirings whose elements encode the annotation of a tuple across a set of possible worlds. This encoding is a technical device that allows us to adopt results from the theory of relations directly to our problem. We assume a fixed set of possible world identifiers for some number of possible worlds . Given the domain of a semiring , we write to denote the set of elements from the way crossproduct of . We annotate tuples with elements of to store annotations of in each possible world. We use , , … to denote elements from to make explicit that they are vectors.
Definition 2 (Possible World Semiring).
Let be an lsemiring. We define the possible world semiring . The operations of this semiring are defined as follows
Thus, a database is simply a pivoted representation of an incomplete database.
Example 8 ().
Reconsider the incomplete relation from Example 7. The encoding of this database as a relation is:
locale  state  
Lasalle  NY  [3,2] 
Tucson  AZ  [2,1] 
Greenville  IN  [0,5] 
Translating between incomplete databases and databases is trivial. Given an incomplete database with possible worlds , we create the corresponding database by annotating each tuple with the vector . In the other direction, given a database with vectors of length , we construct the corresponding incomplete database by annotating each tuple with in possible world . In addition, we will show below that queries over databases encode possible world semantics. Thus, the following result holds and we can use incomplete  and databases interchangeably.
Proposition 1 ().
Incomplete databases and databases are isomorphic wrt. possible worlds semantics for queries.
Observe that is a semiring, since we define using the way version of the product operation of universal algebra, and products of semirings are also semirings (Burris and Sankappanavar, 2012).
Possible Worlds. We can extract the database for a possible world (e.g., the bestguess world) from a database by projecting on one dimension of its annotations. This can be modeled as a mapping where :
(5) 
Recall that under possible world semantics, the result of a query is the set of worlds computed by evaluating over each world of the input. As a sanity check, we would like to ensure that query processing over relations matches this definition. We can state possible world semantics equivalently as follows: the content of a possible world in the query result () is the result of evaluating query over this possible world in the input (): That is, relations have possible worlds semantics iff commutes with queries:
Recall from Section 2.3 that a mapping between semirings commutes with queries iff it is a semiring homomorphism. Note that relations admit a trivial extension to probabilistic data by defining a distribution . See ([Anonymized], [n. d.]) for details.
Lemma 1 ().
For any semiring and possible world , mapping is a semiring homomorphism.
Proof.
See Appendix A ∎
Probabilistic Data. relations admit a trivial extension to probabilistic data by defining a distribution such that . In contrast to classical frameworks for possible worlds, where the collection of worlds is a set, queries preserve the same possible worlds^{3}^{3}3Although it has no impact on our results, it is worth noting that the worlds in a query result may not be distinct.. Hence, the input distribution applies, unchanged, to the possible query outputs.
Certain and Possible Annotations. Since the annotation of a tuple in a database is a vector recording ’s annotations in all worlds, certain annotations for incomplete databases are computed by applying to the set of annotations contained in the vector. Thus, the certain annotation of a tuple from a DB is computed as:
4. Labeling Schemes
We define efficient (PTIME) labeling schemes for three existing incomplete data models: TupleIndependent databases (Suciu et al., 2011), the disjointindependent xrelation model from (Agrawal et al., 2006), and CTables (Imielinski and Jr., 1984). We also show how to extract a bestguess world from an database derived from these models. Since computing certain answers is hard in general, our PTIME labeling schemes cannot be ccorrect for all models.
4.1. Labeling Schemes
TupleIndependent Databases. A tuple\hypindependent database (TIDB) is a database where each tuple is marked as optional or not. The incomplete database represented by a TIDB is the set of instances that include all nonoptional tuples and some subset of the optional tuples. That is, the existence of a tuple is independent of the existence of any other tuple . In the probabilistic version of TIDBs each tuple is associated with its marginal probability. The probability of a possible world is then the product of the probability of all tuples included in the world multiplied by the product of for all tuples from that are not part of the possible world. We define a labeling function for TIDBs that returns a labeling that annotates a tuple with (certain) iff it is not optional. For probabilistic TIDBs we label tuples as certain if their marginal probability is .
Theorem 1 ( is ccorrect).
Given a TIDB , is a ccorrect labeling.
Proof.
Trivially holds. An incomplete (probabilistic) database tuple is certain iff it is not optional (if ). ∎
Ctables. CTables (Imielinski and Jr., 1984) use a set of variable symbols to define possible worlds. Tuples are annotated by a boolean expression over comparisons of values from , called the local condition. Each variable assignment satisfying a boolean expression called the global condition defines a possible world, derived by retaining only tuples with local conditions satisfied under . Computing certain answers for first order queries is coNPcomplete (Vardi, 1986; Abiteboul et al., 1991) even for Coddtables. Since the result of any first order query over a Coddtable can be represented as a Ctable and evaluating a query in this fashion is efficient, it follows that determining whether a tuple is certain in a Ctable cannot be in PTIME. Instead, consider the following sufficient, but not necessary condition for a tuple to be certain. If (1) a tuple in a Ctable contains only constants and (2) its local condition is a tautology, then the tuple is certain. To see why this is the case, recall that under the closedworld assumption, a Ctable represents a set of possible worlds, one for each valuation of the variables appearing in the Ctable (to constants from ). A tuple is part of a possible world corresponding to such a valuation if the tuple’s local condition is satisfied under the valuation. Thus, a tuple consisting of constants only, with a local condition that is a tautology is part of every possible world represented by the Ctable. If the local condition of a tuple is in conjunctive normal form (CNF) then checking whether it is a tautology is efficient (PTIME). Our labeling scheme for Ctables applies this sufficient condition and, thus, is csound. Formally, , where for a Ctable and any tuple :
Green et. al. (Green and Tannen, 2006) introduced PCtables a probabilistic version of Ctables where each variable is associated with a probability distribution over its possible values. Variables are considered independent of each other, i.e., the probability of a possible world is computed as the product of the probabilities of the individual variable assignments based on which the world was created. Our labeling scheme works for both the incomplete and probabilistic version of Ctables.
Theorem 2 ( is csound).
Given an incomplete database encoded as Ctables, is csound.
Note that is not guaranteed to be ccorrect. For instance, a tuple consisting only of constants and for which is a tautology is guaranteed to be certain, but if is not in CNF.
Example 9 ().
Consider a Ctable consisting of two tuples with and with . would mark as uncertain, because even though this tuple exists in the Ctable and it’s local condition is in CNF, the local condition is not a tautology. However, tuple is certain since either and then first tuple evaluates to or and the second tuple is included in the possible world.
xDBs. An xDB (Agrawal et al., 2006) is a set of xrelations, which are sets of xtuples. An xtuple is a set of tuples with a label indicating whether the xtuple is optional. Each xtuple is assumed to be independent of the others, and its alternatives are assumed to be disjoint. Thus, a possible world of an xrelation is constructed by selecting at most one alternative for every xtuple from if is optional, or exactly one if it is not optional. The probabilistic version of xDBs (also called a BlockIndependent or BIDB) as introduced in (Agrawal et al., 2006) assigns each alternative a probability and we require that . Thus, a tuple is optional if and there is no need to use labels to mark optional tuples. We use to denote the number of alternatives of xtuple . We define a labeling scheme for xrelations that annotates a tuple from an xDB with if is the single, nonoptional alternative of an xtuple, and otherwise. In probabilistic xDBs we check .
Theorem 3 ( is ccorrect).
Given a database , is a ccorrect labeling.
4.2. Extracting bestguess worlds
Computing some possible world is trivial for most incomplete and probabilistic data models. However, for the case of probabilistic data models we are particularly interested in the highestprobability world (the best guess world). We now discuss in more detail how we choose the BGW for the data models for which we have introduced labeling schemes above.
TIDB. For a TIDB , the best guess world consists of all tuples such that . To understand why this is the case recall that the probability of a world from a TIDB is the product of the probabilities of included tuples with one minus the probability of excluded tuples. This probability is maximized by including only tuples where . For the incomplete version of TIDBs we have to include all nonoptional tuples and can choose arbitrarily which optional tuples to include in .
PCtables. For a PCtable, computing the most likely possible world reduces to answering a query over the database, which is known to be #P in general (Suciu et al., 2011). Specific tables (e.g., those generated by “safe” queries (Suciu et al., 2011)) admit PTIME solutions. Alternatively, there exist a wide range of algorithms (Gatterbauer and Suciu, 2017; Fink et al., 2011, 2013; Li et al., 2011) that can be used to compute an arbitrarily close approximation of the most likely world.
Disjointindependent databases. Since the xtuples in an xDB are independent of each other, the probability of a possible world from an xDB is maximized by including for every xtuple its alternative with the highest probability or no alternative if , i.e., if the probability of not including any alternative for the xtuple is higher than the highest probability of an alternative for the xtuple.
5. UADatabases
We now introduce UADBs (uncertaintyannotated databases) which encode both under\hyp and over\hypapproximations of the certain annotations of an incomplete database . This is achieved by annotating every tuple with a pair where records the tuple’s annotation in the BGW (, for some ) and stores the underapproximation of the tuple’s certain annotation (i.e., ). Both under\hyp and over\hypapproximations of certain annotations assign tuples annotations from , making them databases. This will be important for proving that these bounds are preserved under queries. Every possible world is by definition a superset of the certain tuples, so a UADB contains all certain answers, even though the certainty of some answers may be underestimated. We start by formally defining the annotation domains of UADBs and mappings that extract the two components of an annotation. Afterwards, we state the main result of this section: queries over UADBs preserve the under\hyp and over\hypapproximation of certain annotations.
5.1. UAsemirings
We define a UAsemiring as a semiring, i.e., the direct product of a semiring with itself (see Section 5.1). ^{inline}^{inline}todo: inlineBoris says: Afterwards, we prove that the result of a query over a UADB encodes the result of the query over the input possible world and uncertainty labeling. In the following we will write instead of if the semiring is clear from the context. Recall that operations in are defined pointwise, e.g., .
Definition 3 (UAsemiring).
Let be a semiring. We define the corresponding UAsemiring
Note that for any , is a semiring, because, as mentioned earlier, products of semirings are semirings.
5.2. Creating UADBs
We now discuss how to derive UArelations from a database or a compact encoding of a database using some uncertain data model like ctables. Consider a database , let be one of its worlds and a database under\hypapproximating the certain annotations of . We refer to as a labeling and will study such labelings in depth in Section 6 and 7. We cover in Section 4 how to generate a UADB from common uncertain data models by extracting a (bestguess) world and a labeling . We construct a UADB as an encoding of and by setting for every tuple :
For a UADB constructed in this fashion we say that approximates by encoding . Given a UADB , we would like to be able to restore and from . For that we define two morphisms :
Note that by construction, if an UADB is an encoding of a possible world and a labeling of a database then: .
5.3. Querying UADBs
We now state the main result of this section: query evaluation over UADBs preserves the under\hypapproximation and over\hypapproximation of certain annotations. To prove the main result, we first show that and are homomorphisms, because this implies that queries over UADBs are evaluated over the and the component of an annotation independently. Thus, we can prove the result for under\hyp and over\hypapproximations separately. For over\hypapproximation we can trivially show an even better result: By definition (Section 3.2) the possible world used as an overapproximation is preserved exactly. Hence, the overapproximation property is preserved and UADBs are also backwards compatible with BGQP. For under\hypapproximations we have to show that query evaluation preserves under\hypapproximations. This part is more involved and we will prove this result in Section 7.
Theorem 4 (Queries Preserve Bounds).
Let be a database, a labeling for , one of its possible worlds, and be the UADB encoding the pair . Clearly approximates . Then is an approximation for encoding the pair .
Proof.
See Appendix A ∎
6. Uncertainty Labelings
We now define uncertainty labelings, which are databases whose annotations over\hyp or under\hypapproximate certain annotations of tuples in a database with respect to the natural order of semiring . A labeling scheme is a mapping from an incomplete databases to labelings.
Definition 4 (Uncertainty Labeling Scheme).
Let be the set of all databases, an incomplete/probabilistic data model, and the set of all possible instances of this model. An uncertainty labeling scheme is a function such that the labeling has the schema .
Ideally, we would like the label (annotation) of a tuple from an uncertainty labeling to be exactly . Observe that an exact labeling can always be computed in time if all worlds of the incomplete database can be enumerated. However, the number of possible worlds is frequently exponential in the data size. Thus, most incomplete data models rely on factorized encodings, with size typically logarithmic in . Ideally, we would like labeling schemes to be PTIME in the size of the encoding (rather than in ). As mentioned in the introduction, computing certain answers is coNPcomplete, so for tractable query semantics we must accept that may either over\hyp or under\hypapproximate (with respect to ). For instance, under bag semantics (semiring ), a label may be smaller or larger than the certain multiplicity of a tuple. We call a labeling csound (no false positives) if it consistently under\hypapproximates the certain annotation of tuples, ccomplete (no false negatives) if it consistently over\hypapproximates certainty, and ccorrect if it annotates every tuple with its certain annotation. We also apply this terminology to labeling schemes, e.g., a csound labeling scheme only produces csound labelings. For UADBs we are mainly interested in csound labeling schemes to provide an under\hypapproximation of certain annotations.
Definition 5 ().
If is an uncertainty labeling for .
We call …  …iff for all tuples … 
csound  
ccomplete  
ccorrect 
A labeling is both csound and ccomplete iff it is ccorrect. Ideally, queries over labelings would preserve these bounds.
Definition 6 (Preservation of Bounds).
A query semantics for uncertainty labelings preserves a property (csoundness, ccompleteness, or ccorrectness) wrt. a class of queries , if for any incomplete database , labeling for that has property , and query we have: is an uncertainty labeling for with property .
7. Querying Labelings
We now study whether queries over labelings produced by labeling schemes such as the ones described in Section 4 preserve csoundness. Specifically, we demonstrate that standard relational query evaluation preserves csoundness for any csound labeling scheme. Recall that a query semantics for labelings preserves csoundness if a query evaluated on a csound labeling of incomplete database is a csound labeling for . Our result generalizes a previous result of Reiter (Reiter, 1986) to any type of incomplete database for which we can define an efficient csound labeling scheme. We need the following lemma, to show that the natural order of a semiring factors through addition and multiplication. This is a known result that we only state for completeness.
Lemma 2 ().
Let be a naturally ordered semiring. For all we have:
Proof.
See Appendix A ∎
7.1. Preservation of CSoundness
We now prove that over labelings preserves c\hypsoundness. Since queries over both databases and labelings have relational query semantics, we can make use of the fact that over relations is defined using and . At a high level, the argument is as follows: (1) we show that applied to the result of an addition (or multiplication) of two elements and yields a larger (wrt. ) result than adding (or multiplying) the result of applying to and ; (2) Since csound labelings for an input provide a lower bound on , we can apply Lemma 2 to show that the query result over a csound (or ccorrect) labeling is a lower bound for of the result of the query. Combining arguments, we get preservation of csoundness.
Functions that have the property mentioned in (1) are called superadditive and supermultiplicative. Formally, a function where and are closed under addition and multiplication, and is ordered (order ) is superadditive (supermultiplicative) iff for all :
(superadditive)  
(supermultiplicative) 
In a nutshell, if we are given a csound labeling, then evaluating any query over the labeling using relational query semantics preserves csoundness if we can prove that is superadditive and supermultiplicative.
Lemma 3 ().
Let be a semiring. is superadditive and supermultiplicative wrt. the natural order .
Proof.
See Appendix A ∎
Using the superadditivity and multiplicativity of , we now prove preservation of csoundness. We first prove a restricted version of this result.
Lemma 4 ().
Let be a database and be a ccorrect labeling for . queries over preserve csoundness.
The major drawback of Lemma 4 is that it is limited to ccorrect input labelings. Next, we show that csoundness is still preserved even if the input labeling is only csound.
Theorem 5 ().
Let be a database and a csound labeling for . queries over preserve csoundness.
Proof.
See Appendix A ∎
In Appendix 8 we demonstrate that under certain circumstances, queries also preserve ccompleteness.
8. Preservation of CCompleteness
TIDBs. We now demonstrate that positive queries preserve ccompleteness if the input is a labeling produced by the ccomplete labeling scheme (Section 4). To show this, we observe that the existence of a world for in which two elements and are both minimal then commutes with addition and multiplication, and standard relational semantics preserve ccompleteness.
Lemma 5 ().
Let for some possible world semiring . If there exists such that and , then the following holds:
Proof.
To demonstrate ccompleteness preservation for TIDBs we have to demonstrate that the encoding of a TIDB as a database fulfills the precondition of Lemma 5.
Lemma 6 ().
Let be a database that represents a TIDB. Then there exists such that for any tuple :
.
Proof.
Consider the possible world defined as follows:
This world exists, because in a TIDB all tuples with probability have annotation in all worlds. Furthermore, since the tuples are independent events, there must exist one world containing no tuples with probability . Let denote the identifier of this world and denote by . (Case 1) and so . (Case 2) and . Because , it follows that . As a result, ∎
Lemmas 5 and 6 together imply that our labeling approach preserves ccompleteness if the input is a TIDB.
Corollary 1 ().
Let be a labeling for a TIDB computed as . Then over preserves ccompleteness.
xDBs. In general, queries over labelings derived from xDBs using our labeling scheme from Section 4 do not preserve ccompleteness. We present a sufficient condition for a query to preserve ccompleteness over such a labeling. To this end, we define xkeys, constraints that ensure that alternatives within the scope of an xtuple are not all identical if projected on a set of attributes . Since our labeling scheme for xDBs is ccomplete, queries preserve ccompleteness unless a result tuple that is certain is derived from multiple correlated uncertain input tuples. Since xtuples from an xDB are independent of each other, this can only be the case if a result tuple is derived from alternatives of an xtuple from every possible world (i.e., where is not optional). Such a situation can be avoided if it is guaranteed that it is impossible for a result tuple to be derived from all alternatives of an xtuple.
Definition 7 (xkey).
Let be an xrelation with schema . A set of attributes is called an xkey for iff