Mimir: Bringing CTables into Practice\ast

Mimir: Bringing CTables into Practice

Arindam Nandi, Ying Yang, Oliver Kennedy
Boris Glavic, Ronny Fehling, Zhen Hua Liu, Dieter Gawlick
University at Buffalo       Illinois Institute of Technology       Airbus       Oracle
{arindamn, yyang25, okennedy}@buffalo.edu       bglavic@iit.edu
ronny.fehling@airbus.com       {zhen.liu,dieter.gawlick}@oracle.com
Abstract

The present state of the art in analytics requires high upfront investment of human effort and computational resources to curate datasets, even before the first query is posed. So-called pay-as-you-go data curation techniques allow these high costs to be spread out, first by enabling queries over uncertain and incomplete data, and then by assessing the quality of the query results. We describe the design of a system, called Mimir, around a recently introduced class of probabilistic pay-as-you-go data cleaning operators called Lenses. Mimir wraps around any deterministic database engine using JDBC, extending it with support for probabilistic query processing. Queries processed through Mimir produce uncertainty-annotated result cursors that allow client applications to quickly assess result quality and provenance. We also present a GUI that provides analysts with an interactive tool for exploring the uncertainty exposed by the system. Finally, we present optimizations that make Lenses scalable, and validate this claim through experimental evidence.

Mimir: Bringing CTables into Practice


Arindam Nandi, Ying Yang, Oliver Kennedy
Boris Glavic, Ronny Fehling, Zhen Hua Liu, Dieter Gawlick
University at Buffalo       Illinois Institute of Technology       Airbus       Oracle

{arindamn, yyang25, okennedy}@buffalo.edu       bglavic@iit.edu

ronny.fehling@airbus.com       {zhen.liu,dieter.gawlick}@oracle.com


\@float

copyrightbox[b]

\end@float
  • Uncertain Data, Provenance, ETL, Data Cleaning

    Data curation is presently performed independently of an analyst’s needs. To trust query results, an analyst first needs to establish trust in her data, and this process typically requires high upfront investment of human and computational effort. However, the level of cleaning effort is often not commensurate with the specific analysis to be performed. A class of so-called pay-as-you-go [?], or on-demand [?] data cleaning systems have arisen to flatten out this upfront cost. In on-demand cleaning settings, an analyst quickly applies data cleaning heuristics without needing to tune the process or supervise the output. As the analyst poses queries, the on-demand system continually provides feedback about the quality and precision of the query results. If the analyst wishes higher quality, more precise results, the system can also provide guidance to focus the analyst’s data cleaning efforts on curating inputs that are relevant to the analysis.

    In this paper we describe Mimir, a system that extends existing relational database engines with support for on-demand curation. Mimir is based on lenses [?], a powerful and flexible new primitive for on-demand curation. Lenses promise to enable a new kind of uncertainty-aware data analysis that requires minimal up-front effort from analysts, without sacrificing trust in the results of that analysis. Mimir is presently compatible with SQLite and a popular commercial enterprise database management system.

    In the work that first introduced Lenses [?] we demonstrated how curation tasks including domain constraint repair, schema matching, and data archival can be expressed as lenses. Lenses in general, are a family of unary operators that (1) Apply a data curation heuristic to clean or validate their input, and (2) Annotate their output with all assumptions or guesses made by the heuristic. Critically, lenses require little to no upfront configuration — the lens’ output represents a best-effort guess. Previous efforts on uncertain data management [?] focus on producing exclusively correct, or certain results. By comparison, lenses may include incorrect results. Annotations on the lens output persist through queries and provide a form of provenance that helps analysts understand potential sources of error and their impact on query results. This in turn allows an analyst to decide whether or not to trust query results, and how to best allocate limited resources to data curation efforts.

    Product
    id name brand cat ROWID
    P123 Apple 6s, White ? phone R1
    P124 Apple 5s, Black ? phone R2
    P125 Samsung Note2 Samsung phone R3
    P2345 Sony to inches ? ? R4
    P34234 Dell, Intel 4 core Dell laptop R5
    P34235 HP, AMD 2 core HP laptop R6
    Ratings1
    pid rating review_ct ROWID
    P123 4.5 50 R7
    P2345 ? 245 R8
    P124 4 100 R9
    Ratings2
    pid evaluation num_ratings ROWID
    P125 3 121 R10
    P34234 5 5 R11
    P34235 4.5 4 R12
    Figure \thefigure: Incomplete error-filled example relations, including an implicit unique identifier attribute ROWID.
    Example 1

    Alice is an analyst at a retail store and is developing a promotional strategy based on public opinion ratings gathered by two data collection companies. A thorough analysis of the data requires substantial data curation effort from Alice: As shown in Figure Mimir: Bringing CTables into Practice, the rating company’s schemas are incompatible, and the store’s own product data is incomplete. However, Alice’s preliminary analysis is purely exploratory, and she is hesitant to invest the effort required to fully curate this data. She creates a lens to fix missing values in the Product table:

    CREATE LENS SaneProduct AS SELECT * FROM Product
    USING DOMAIN_REPAIR(cat string NOT NULL,
                        brand string NOT NULL);

    From Alice’s perspective, the lens SaneProduct behaves as a standard database view. However, the content of the lens is guaranteed to satisfy the domain constraints on category and brand. NULL values in these columns are replaced according to a classifier built over the Product table. Under the hood, the Mimir system maintains a probabilistic version of the view as a so-called Virtual C-Table (VC-Table). A VC-Table cleanly separates the existence of uncertainty (e.g., the category value of a tuple is unknown), the explanation for how the uncertainty affects a query result (this is a specific type of provenance), and the model for this uncertainty as a probability distribution (e.g., a classifier for category values that is built when the lens is created).

    Uncertainty is encoded in a VC-Table through attribute values that are symbolic expressions over variables representing unknowns. A probabilistic model for these variables is maintained separately. Queries over a VC-Table can be translated into deterministic SQL over the lens’ deterministic inputs. This is achieved by evaluating deterministic expressions as usual and by manipulating symbolic expressions for computations that involve variables. The result is again a relational encoding of a VC-Table. The probabilistic model (or an approximation thereof) can be “plugged” into the expressions in a post-processing step to get a deterministic result. This approach has several advantages: (1) the probabilistic model can be created in a pay-as-you-go fashion focusing efforts on the part that is relevant for an analyst’s query; (2) the symbolic expressions of a VC-Table serve as a type of provenance that explain how the uncertainty affects the query result; (3) changes to the probabilistic model or switching between different approximations for a model only require repetition of the post-processing step over the already computed symbolic query result; (4) large parts of the computation can be outsourced to a classical relational database; and (5) queries over a mixture of VC-Tables and deterministic tables are supported out of the box. A limitation of our preliminary work with VC-Tables is the scalability of the expressions outsourced to the deterministic database. Our initial approach sometimes creates outsourced expressions that can not be evaluated efficiently. In this paper, we address this limitation, and in doing so demonstrate that VC-Tables are a scalable and practical tool for managing uncertain data. Concretely, in this paper:


    Figure \thefigure: Grammars for boolean expressions and numerical expressions including VG-Functions .

    Possible Worlds Semantics.  An uncertain database over a schema is defined as a set of possible worlds: deterministic database instances over schema . Possible worlds semantics defines queries over uncertain databases in terms of deterministic query semantics. A deterministic query applied to an uncertain database defines a set of possible results . Note that these semantics are agnostic to the data representation, query language, and number of possible worlds . A probabilistic database is an uncertain database annotated with a probability distribution that induces a distribution over all possible result relations :

    A probabilistic query processing (PQP) system is supposed to answer a deterministic query by listing all its possible answers and annotating each tuple with its marginal probability. These tasks are often #P-hard in practice, necessitating the use of approximation techniques.

    C-Tables and PC-Tables.  One way to make probabilistic query processing efficient is to encode and through a compact, factorized representation. In this paper we adopt a generalized form of C-Tables [?, ?] to represent , and PC-Tables [?, ?] to represent the pair . A C-Table [?] is a relation instance where each tuple is annotated with a formula , a propositional formula over an alphabet of variable symbols . The formula is often called a local condition and the symbols in are referred to as labeled nulls, or just variables. Intuitively, for each assignment to the variables in we obtain a possible relation containing all the tuples whose formula is satisfied. For example:

    Product pid name brand category P123 Apple 6s Apple phone P123 Apple 6s Cupertino phone P125 Note2 Samsung phone

    The above C-Table defines a set of two possible worlds, , , i.e. one world for each possible assignment to the variables in the one-symbol alphabet . Notice that no possible world can have both and at the same time. Adding a probabilistic model for the variables, e.g., as shown above, we get a PC-table. For instance, in this example the probability that the brand of product is Apple is . C-Tables are closed w.r.t. positive relational algebra [?] : if is representable by a C-Table and is a positive query then is representable by another C-Table.

    VG-Relational Algebra.  VG-RA (variable-generating relational algebra) [?] is a generalization of positive bag-relation algebra with extended projection, that uses a simplified form of VG-functions [?]. In VG-RA, VG-functions (i) dynamically introduce new Skolem symbols in , that are guaranteed to be unique and deterministically derived by the function’s parameters, and (ii) associate the new symbols with probability distributions. Hence, VG-RA can be used to define new PC-Tables. Primitive-valued expressions in VG-RA (i.e., projection expressions and selection predicates) use the grammar summarized in Figure Mimir: Bringing CTables into Practice. The primary addition of this grammar is the VG-Function term representing unknown values: .

    VG-RA’s expression language enables a generalized form of C-Tables, where attribute-level uncertainty is encoded by replacing missing values with VG-RA expressions (not just variables) that act as freshly defined Skolem terms. For example, the previous PC-Table is equivalent to the generalized PC-Table:

    Product pid name brand category P123 Apple 6s phone P125 Note2 Samsung phone

    It has been shown that generalized C-Tables are closed w.r.t VG-RA [?, ?]. Evaluation rules for VG-RA use a lazy evaluation operator , which uses a partial binding of or atoms to corresponding expressions. Lazy evaluation applies the partial binding and then reduces every sub-tree in the expression that can be deterministically evaluated. Non-deterministic sub-trees are left intact.

    Any tuple attribute appearing in a C-Table can be encoded as an abstract syntax tree for a partially evaluated expression that assigns it a value. This is the basis for evaluating projection operators, where every expression in the projection’s target list is lazily evaluated. Column bindings are given by each tuple in the source relation. The local condition is preserved intact through the projection. Selection is evaluated by combining the selection predicate with each tuple’s existing local condition. For example, consider a query over the example PC-Table. The result of this query is shown below. The second tuple of the input table does not fulfil the selection condition and is thus guaranteed to not be in the result. Note the symbolic expressions in the local condition and attribute values. Furthermore, note that the probabilistic model for the single variable is not influenced by the query at all.

    Query Result brand category phone

    From now on, we will implicitly assume this generalized form of C-Tables.

    Lenses.  Lenses use VG-RA queries to define new C-Tables as views: A lens defines an uncertain view relation through a VG-RA query , where and to represents the non-deterministic an deterministic components of the query, respectively. Independently, the lens constructs as a joint probability distribution over every variable introduced by , by defining a sampling process in the style of classical VG-functions [?], or supplementing it with additional meta-data to create a PIP-style grey-box [?]. These semantics are closed over PC-Tables. If is non-deterministic — that is, the lens’ input is defined by a PC-Table — the lens’ semantics are virtually unchanged due to the closure of VG-RA over C-Tables.

    Example 2

    Recall the lens definition from Example 1. This lens defines a new C-Table using the VG-RA query:

    In this expression denotes a check for domain compliance, and a replacement with a non-deterministic value if the check fails, as follows:

    The models for and are defined by classifiers trained on the contents of .

    Virtual C-Tables.  Consider a probabilistic database in which all non-determinism is derived from lenses. In this database, all C-Tables, including those resulting from deterministic queries over non-deterministic data can be expressed as VG-RA queries over a deterministic database . Furthermore, VG-RA admits a normal form [?] for queries where queries are segmented into a purely deterministic component and a non-deterministic component . These normalization rules are shown in Figure Mimir: Bringing CTables into Practice.

    (1)
    (2)
    (3)
    (4)
    Figure \thefigure: Reduction to VG-RA Normal Form.

    Normalization does not affect the linkage between the C-Table computed by a VG-RA query and its associated probability measure : remains unchanged. Moreover, the non-deterministic component of the normal form is a simple composite projection and selection operation.

    The simplicity of carries two benefits. First, the deterministic component of the query can be evaluated natively in a database engine, while the non-deterministic component can be applied through a simple shim interface wrapping around the database. Second, the abstract syntax tree of the expression acts a form of provenance [?, ?] that annotates uncertain query results with metadata describing the level and nature of their uncertainty, a key component of the system we now describe. For example, in the query result shown above it is evident that the tuple will be in the result as long as the condition , R1) = ‘Apple’ evaluates to true. Mimir provides an API for the user to retrieve this type of explanation for a query result and comes with a user interface that visualizes explanations.

    The Mimir system is a shim layer that wraps around an existing DBMS to provide support for lenses. Using Mimir, users define lenses that perform common data cleaning operations such as schema matching, missing value interpolation, or type inference with little or no configuration on the user’s part. Mimir exports a native SQL query interface that allows lenses to be queried as if they were ordinary relations in the backend database. A key design feature of Mimir is that it has minimal impact on its environment. Apart from using the backend database to persist metadata, Mimir does not modify the database or its data in any way. As a consequence, Mimir can be used alongside any existing database workflow with minimal effort and minimal risk.

    Users define lenses through a CREATE LENS statement that immediately instantiates a new lens.

    Example 3

    Recall the example data from Figure Mimir: Bringing CTables into Practice. To merge the two ratings relations, Alice needs to re-map the attributes of Ratings2. Rather than doing so manually, she defines a lens that re-maps the attributes of the Ratings2 relation to those of Ratings1 as follows.

    CREATE LENS MatchedRatings2 AS
    SELECT * FROM Ratings2
    USING SCHEMA_MATCHING(pid string, …,
          rating float, review_ct float, NO LIMIT);

    CREATE LENS statements behave like a view definition, but also apply a data curation step to the output; in this case schema matching. Mapping targets may be defined explicitly or by selecting an existing relation’s schema in the GUI.

    In addition to a command-line tool, Mimir provides a Graphical User Interface (GUI) illustrated in Figure Mimir: Bringing CTables into Practice. Users pose queries over lenses and deterministic relations using standard SQL SELECT statements (a). Mimir responds to queries over lenses with a best guess result, or the result of the query in the possible world with maximum likelihood. In contrast to the classical notion of “certain” answers, the best guess may contain inaccuracies. However, all uncertainty arises from terms introduced by lenses. Consequently, using the provenance of each row and cell, Mimir can identify potential sources of error: Attribute values that depend on a term may be incorrect, and filtering predicates that depend on a term may lead to rows incorrectly being included or excluded in the result. We refer to these two types of error as non-deterministic cells, and non-deterministic rows, respectively.

    Example 4

    Recall the result of the example query in Section Mimir: Bringing CTables into Practice, which shows a VC-table before the best guess values are plugged in. The only row is non-deterministic, because its existence depends on the value of which denotes the unknown brand of this tuple. The brand attribute value of this tuple is a non-deterministic cell, because its value depends on the same expression.

    In Mimir, query results (b) visually convey potential sources of error through several simple cues. First, a small provenance graph (c) helps the user quickly identify the data’s origin, what cleaning heuristics have been applied, and where.

    Potentially erroneous results are clearly identified: Non-deterministic rows have a red marker on the right and a grey background, while non-deterministic cells are highlighted in red.

    Figure \thefigure: The Graphical Mimir User Interface

    Clicking on a non-deterministic row or cell brings up an explanation window (d). Here, Mimir provides the user with statistical metrics summarizing the uncertainty of the result, as well as a list of human-readable reasons why the result might be incorrect. Each reason is linked to a specific lens; If the user believes a reason to be incorrect, she can click “Fix” to override the lens’ data cleaning decision. An “Approve” button allows a user to indicate that the lens heuristic’s choice is satisfactory. Once all reasons for a given row or value’s non-determinism have been either approved or fixed, the row or value becomes green to signify that it is now deterministic.

    Example 5

    Figure Mimir: Bringing CTables into Practice shows the results of a query where one product (with id ‘P125’) has an unusually high rating of 121.0. By clicking on it, Alice finds that a schema matching lens has incorrectly mapped the NUM_RATINGS column of one input relation to the RATINGS column of the other input relation — 121 is the number of ratings for the product, not the actual rating itself. By clicking on the fix button, Alice can manually specify the correct match and Mimir re-runs the query with the correct mapping.

    Making sources of uncertainty easily accessible allows the user to quickly track down errors that arise during heuristic data cleaning, even while viewing the results of complex queries. Limiting Mimir to simple signifiers like highlighting and notifications prevents the user from being overwhelmed by details, while explanation windows still allow the user to explore uncertainty sources in more depth at their own pace.

    The Mimir system’s architecture is shown in Figure Mimir: Bringing CTables into Practice. Mimir acts as an intermediary between users and a backend database using JDBC. Mimir exposes the database’s native SQL interface, and extends it with support for lenses. The central feature of this support is five new functions in the JDBC result cursor class that permit client applications such as the Mimir GUI to evaluate result quality. The first three indicate the presence of specific classes of uncertainty: (1) isColumnDeterministic(int | String) returns a boolean that indicates whether the value of the indicated attribute was computed deterministically without having to “plug in” values for variables. In our graphical interface, cells for which this function returns false are highlighted in red. Note that the same column may contain both deterministic and non-deterministic values (e.g., for a lens that replaces missing values with interpolated estimates) (2) isRowDeterministic() returns a boolean that indicates whether the current row’s presence in the output can be determined without using the probabilistic model. In our graphical interface, rows for which this function returns false are also highlighted. (3) nonDeterministicRowsMissing() returns a count of the number of rows that have been so far ommitted from the result, and were discarded based on the output of a lens. In our graphical interface, when this method returns a number greater than zero after the cursor is exhausted, a notification is shown on the screen.

    As we discuss below, limiting the response of these functions to a simple boolean makes it possible to evaluate them rapidly, in-line with the query itself. For additional feedback, Mimir provides two methods: explainColumn(int | String) and explainRow() Both methods construct and return an explanation object as detailed below. In the graphical Mimir interface, these methods are invoked when a user clicks on a non-deterministic (i.e., highlighted) row or cell, and the resulting explanation object is used to construct the uncertainty summary shown in the explanation window. Explanations do not need to be computed in-line with the rest of the query, but to maintain user engagement, explanations for individual rows or cells still need to be computed quickly when requested.

    Figure \thefigure: The Mimir System

    A lens consists of two components: (1) A VG-RA expression that computes the output of the lens, introducing new variables in the process using terms, and (2) A model object that defines a probability space for every introduced variable. Recall that terms act as skolem functions, introducing new variable symbols based on their arguments. For example the ROWID attribute can be used to create a distinct variable named “X” for every row using the expression . Correspondingly, we distinguish between terms and the variable instances they create. Note that the latter is uniquely identified by the name and arguments of the term. The model object has three mandatory methods: (1) getBestGuess(var) returns the value of the specified variable instance in the most likely possible world. (2) getSample(var, id) returns the value of the specified variable instance in a randomly selected possible world. acts as a seed value, ensuring that the same possible world is selected across multiple calls. (3) getReason(var) returns a human-readable explanation of the heuristic guess represented by the specified variable instance. The getBestGuess method is used to produce best-guess query results. The remaining two methods are used by explanation objects. As in PIP [?] and Orion 2.0 [?], optional metadata can sometimes permit the use of closed-form solutions when computing statistical metrics for result values.

    Explanation objects provide a means for client applications like the GUI to programmatically analyze the non-determinism of a specific row or cell. Concretely, an explanation object provides methods that compute: (1) Statistical metrics that quantitatively summarize the distribution of possible result outcomes, and (2) Qualitative summaries or depictions of non-determinism in the result.

    Available statistical metrics depend on whether the explanation object is constructed for a row or a cell, and in the latter case also on what type of value is contained in the cell. For rows, the explanation object has only one method that computes the row’s confidence, or the probability that the row is part of the result set.

    For numerical cells, the explanation object has methods for computing the value’s variance, confidence intervals, and may also be able to provide upper and lower bounds. Variance and confidence intervals are computed analytically if possible, or Monte Carlo style by generating samples of the result using the getSample method on all involved models. When computing these metrics using Monte Carlo, we discard samples that do not satisfy the local condition of the cell’s row, as the cell will not appear in the result at all if the local condition is false. Statistical metrics are not computed for non-numerical cells.

    Quantitative statistical metrics do not always provide the correct intuition about a result value’s quality. In addition to the above metrics, an explanation object can also use Monte Carlo sampling to construct histograms and example result values for non-deterministic cells.

    Furthermore, for both cells and rows, an explanation object can produce a list of reasons — the human readable summaries obtained from each participating model’s getReason method. Reasons are ranked according to the relative contribution of each term to the uncertainty of the result using a heuristic called CPI [?].

    Queries issued to Mimir are parsed into an intermediate representation (IR) based on VG-RA. Mimir maintains a list of all active lenses as a relation in the backend database. References to lenses in the IR are replaced with the VG-RA expression that defines the lens’ contents. The resulting expression is a VG-RA expression over deterministic data residing in the backend database.

    Before queries over lenses are evaluated, the query’s VG-RA expression is first normalized into the form , where , represents a non-deterministic boolean expression, and the subsequent projection assigns the result of non-deterministic expressions to the corresponding attributes . Mimir obtains a classical JDBC cursor for from the backend database and constructs an extended cursor using and the JDBC cursor.

    Recall non-determinism in and arises from terms. When evaluating these expressions, Mimir obtains a specific value for the term using the getBestGuess method on the model object associated with each term. Apart from this, and the expressions are evaluated as normal.

    Determining whether an expression is deterministic or not requires slightly more effort. In principle, we could say that any expression is non-deterministic if it contains a term. However, it is still possible for the expression’s result to be entirely agnostic to the value of the .

    Example 6

    Consider the following expression, which is used in Mimir’s domain constraint repair lens:

    Here, the value of the expression is only non-deterministic for rows where is null.

    Concretely, there are three such cases: (1) conditional expressions where the condition is deterministic, (2) AND expressions where one clause is deterministically false, and (3) OR expressions where one clause is deterministically true. Observe that these cases mirror semantics for NULL and UNKNOWN values in deterministic SQL.

    For each and , the Mimir compiler uses a recursive descent through the expression illustrated in Algorithm 1 to obtain a boolean formula that determines whether the expression is deterministic for the current row. These formulas permit quick responses to the isColumnDeterministic and isRowDeterministic methods. The counter for the nonDeterministicRowsMissing method is computed by using isRowDeterministic on each discarded row.

    0:  : An expression in either grammar from Fig. Mimir: Bringing CTables into Practice.
    0:  An expression that is true when is deterministic.
    1:  if  then
    2:     return  
    3:  else if  is  then
    4:     return  
    5:  else if  is  then
    6:     return  
    7:  else if  is  then
    8:     return  
    9:  else if  is  then
    10:     return  
    11:                      
    12:  else if  is  then
    13:     return  
    14:                      
    15:  else if  is  then
    16:     return  
    17:  else if  is if  then  else  then
    18:     return  
    19:                                 
    Algorithm 1 isDet()

    When one of the explain methods is called, Mimir extracts all of the terms from the corresponding expression, and uses the associated model object’s getReason method to obtain a list of reasons. Variance, confidence bounds, and row-level confidence are computed by sampling from the possible worlds of the model using getSample and evaluating the expression in each possible world. Upper and lower bounds are obtained if possible from an optional method on the model object, and propagated through expressions where possible.

    The primary scalability challenge that we address in this paper relates to how queries are normalized in Virtual C-Tables. Concretely, the problem arises in the rule for normalizing selection predicates:

    Non-deterministic predicates are always pushed into , including those that could otherwise be used as join predicates. When this happens, the backend database is given a cross-product query to evaluate, and the join is evaluated far less efficiently as a selection predicate in the Mimir shim layer.

    In this section, we explore variations on the theme of query normalization. These alternative evaluation strategies make it possible for a traditional database to scalably evaluate C-Table queries, while retaining the functionality of Mimir’s uncertainty-annotated cursors as described in Section Mimir: Bringing CTables into Practice. Supporting C-Tables and annotated cursors carries several challenges:

    Var Terms.  Classical databases are not capable of managing non-determinism, making terms functionally into black-boxes. Although a single best-guess value does exist for each term, the models that compute this value normally reside outside of the database.

    isDeterministic methods.  Mimir’s annotated cursors must be able to determine whether a given row’s presence (resp., a cell’s value) depends on any terms. Using , this is trivial, as all terms are conveniently located in a single expression that is used to determine the row’s presence (resp., to compute a cell’s value). Because these methods are used to construct the initial response shown to the user (i.e., to determine highlighting), they must be fast.

    Potentially Missing Rows.  Annotated cursors must also be able to evaluate the number of rows that could potentially be missing, depending on how the non-determinism is resolved. Although the result of this method is presented to the user as part of the initial query, the value is shown in a notification box and is off of the critical path of displaying the best guess results themselves.

    Explanations.  The final feature that annotated cursors are expected to support is the creation of explanation objects. These do not need to be created until explicitly requested by the user; the initial database query does not need to be directly involved in their construction. However, it must still be possible to construct and return an explanation object quickly to maintain user engagement.

    We now discuss two complimentary techniques for constructing annotated iterators over Virtual C-Tables. Our first approach partitions queries into deterministic and non-deterministic fragments to be evaluated separately. The second approach pre-materializes best-guess values into the backend database, allowing it to evaluate the non-deterministic query with terms inlined.

    We observe that uncertain data is frequently the minority of the raw data. Moreover, for some lenses, whether a row is deterministic or not is data-dependent. Our first approach makes better use of the backend database by partitioning the query into one or more deterministic and non-deterministic segments, computing each independently, and unioning the results together. When the row-determinism of a result depends on deterministic data we can push more work into the backend database for those rows that we know to be deterministic. For this deterministic partition of the data, joins can be evaluated correctly and other selection predicates can be satisfied using indexes over the base data. As a further benefit, tuples in each partition share a common lineage, allowing substantial re-use of annotated cursor metadata for all tuples returned by the query on a single partition. To partition a query , we begin with a set of partitions, each defined by a boolean formula over attributes in . The set of partitions must be complete () and disjoint (). In general, partition formulas are selected such that never contains query results that can be deterministically excluded from .

    Example 7

    Recall the SaneProduct lens from Examples 1 and 2. Alice the analyst now posses a query:

    SELECT name FROM SaneProduct
    WHERE brand = ’Apple’ AND cat = ’phone’

    Some rows of the resulting relation are non-deterministic, but only when the brand or cat in the corresponding row of Product is NULL. Optimizing further, all products that are known to be either non-phones or non-Apple products are also deterministically not in the result.

    Given a set of partitions , the partition rewrite transforms the original query into an equivalent set of partitioned queries as follows:

    where and are respectively the non-deterministic and deterministic clauses of (i.e., ) for each partition. Partitioning then, consists of two stages: (1) Obtaining a set of potential partitions from the original condition , and (2) Segmenting into a deterministic filtering predicate and a non-deterministic lineage component.

    0:  : A non-deterministic boolean expression
    0:  : A set of partition conditions
      
      
      for  do
         /* Check ifs in for candidate partition clauses */
         if  then
            
      /* Loop over the power-set of clauses */
      for  do
         
         /* Clauses in the partition are true, others are false */
         for  do
            if then
                                       else
         
    Algorithm 2

    Algorithm 2 takes the selection predicate in the shim query , and outputs a set of partitions . Partitions are formed from the set of all possible truth assignments to a set of candidate clauses. Candidate clauses are obtained from if statements appearing in that have deterministic conditions, and that branch between deterministic and non-deterministic cases. For example, the if statement in Example 2 branches between deterministic values for non-null attributes, and non-deterministic possible replacements.

    Example 8

    The normal form of the query in the prior example has the non-deterministic condition ():

    There are two candidate clauses in : and . Thus, Algorithm 2 creates 4 partitions: , , , and finally .

    For each partition we can simplify into a reduced form . We use to denote the result of propagating the implications of on . For example, . Using isDet from Algorithm 1, we partition the conjunctive terms of into deterministic and non-deterministic components and , respectively so that

    As discussed in Section Mimir: Bringing CTables into Practice there are three cases where non-determinism can be data-dependent: conditional expressions, conjunctions, and disjunctions. Algorithm 2 naively targets only conditionals. Conjunctions come for free, because deterministic clauses can be freely migrated into the deterministic query already. However, queries including disjunctions can be further simplified.

    Example 9

    We return once again to our running example, but this time with a disjunction in the WHERE clause

    SELECT name FROM SaneProduct
    WHERE brand = ’Apple’ OR cat = ’phone’

    Propagating into the normalized condition gives:

    The output is always deterministic for rows where . However, this formula can not be subdivided into deterministic and non-deterministic components as above.

    We next describe a more aggressive partitioning strategy that uses the structure of to create partitions where each partition depends on exactly the same set of terms. To determine the set of partitions for each sub-query, we use a recursive traversal through the structure of , as shown in in Algorithm 3. In contrast to the naive partitioning scheme, this algorithm explicitly identifies two partitions where is deterministically true and deterministically false. This additional information helps to exclude cases where one clause of an OR (resp., AND) is deterministically true (resp., false) from the non-deterministic partitions. To illustrate, consider the disjunction case handled by Algorithm 3. In addition to the partition where both children are non-deterministic, the algorithm explicitly distinguishes two partitions where one child is non-deterministic and the other is deterministically false. When is segmented, the resulting non-deterministic condition for this partition will be simpler.

    0:  : A non-deterministic boolean expression.
    0:  : The partition where is deterministically true.
    0:  : The partition where is deterministically false.
    0:  : The set of non-deterministic partitions.
      if  is  then
         
         
         
         
         for all  do
            
         for all do
         for all do
      else if  is  then
         /* Symmetric with disjunction */
      else if  is  then
         
         
      else
         
         ;
         for all  do
            if then
                                 else
         
         
    Algorithm 3

    The partition approach makes full use of the backend database engine by splitting the query into deterministic and non-deterministic fragments. The lineage of the condition for each sub-query is simpler, and generally not data-dependent for all rows in a partition. As a consequence, explanation objects can be shared across all rows in the partition. The number of partitions obtained with both partitioning schemes is exponential in the number of candidate clauses. Partitions could conceivably be combined, increasing the number of redundant tuples processed by Mimir to create a lower-complexity query. In the extreme, we might have only two partitions: one deterministic and one non-deterministic. We leave the design of such a partition optimizer to future work.

    During best-guess query evaluation, each variable instance is replaced by a single, deterministic best-guess. Simply put, best-guess queries are themselves deterministic. The second approach exploits this observation to directly offload virtually all computation into the database. Best-guess values for all variable instances are pre-materialized into the database, and the terms themselves are replaced by nested lookup queries that can be evaluated directly.

    As part of lens creation, best-guess estimates must now be materialized. Recall from the grammar in Figure Mimir: Bringing CTables into Practice, terms are defined by a unique identifier id and zero or more parameters (). For each unique variable identifier allocated by the lens, Mimir creates a new table in the database. The schema of the best-guess table consists of the variable’s parameters (), a best guess value for the variable, and other metadata for the variable including whether the user “Accept”ed it. The variable parameters form a key for the best-guess table.

    Example 10

    Recall the domain repair lens from Example 2. To materialize the best-guess relation, Mimir run the lens query to determine all variable instances that are used in the current database instance. In the example, there are 4 such variables, one for each null value. For instance, the missing brand of product will instantiate a variable . For all “brand” variables Mimir will create a best guesses table with primary key param1, a best-guess value value, and attributes storing the additional metadata mentioned above. Mentions of variables in queries over the lens are replaced with a subquery that returns the best guess value. For example, in the expression for brand in the VG-RA query:

    is translated into the SQL expression

    CASE WHEN brand IS NULL
         THEN (SELECT value FROM best_guess_brand b
               WHERE b.param1 = Product.ROWID)
         ELSE brand END

    To populate the best guess tables, Mimir simulates execution of the lens query, and identifies every variable instance that is used when constructing the lens’ output. The result of calling getBestGuess on the corresponding model is inserted into the best-guess table. When a non-deterministic query is run, all references to terms are replaced by nested lookup queries which read the values for terms from the corresponding best guess tables. As a further optimization, the in-lined lens query can also be pre-computed as a materialized view.

    This approach allows deterministic relational databases to directly evaluate best-guess queries over C-Tables, eliminating the need for a shim query to produce results. However, the shim query also provides a form of provenance, linking individual results to the terms that might affect them. Mimir’s annotated cursors rely on this link to efficiently determine whether a result row or cell is uncertain and also when constructing explanation objects.

    For inlining to be compatible with annotated cursors, three further changes are required: (1) To retain the ability to quickly determine whether a given result row or column is deterministic, result relations are extended with a ‘determinism’ attribute for the row and for each column. (2) To quickly construct explanation objects, we inject a provenance marker into each result relation that can be used with the shim query to quickly reconstruct any row or cell’s full provenance. (3) To count the number of potentially missing rows, we initiate a secondary arity-estimation query that is evaluated off of the critical path.

    Recall from Example 6 that expressions involving conditionals, conjunctions, and disjunctions can create situations where the determinism of a row or column is data dependent. In the naive execution strategy, these situations arise exclusively in the shim query and can be easily detected. As the first step towards recovering annotated cursors, we push this computation down into the query itself.

    Concretely, we rewrite a query with schema into a new query with schema . Each is a boolean-valued attribute that is true for rows where the corresponding is deterministic. is a boolean-valued attribute that is true for rows deterministically in the result set. We refer to these two added sets of columns as attribute- and row-determinism metadata, respectively. Query is derived from the input query by applying the operator specific rewrite rules described below, in a top-down fashion starting from the root operator of query .

    Projection.  The projection rewrite relies on a variant of Algorithm 1, which rewrites columns according to the determinism of the input. Consequently, the only change is that the column rewrite on line 5 replaces columns with a reference to the column’s attribute determinism metadata:
         4:  else if E is then
         5:       return
    The rewritten projection is computed by extending the projection’s output with determinism metadata. Attribute determinism metadata is computed using the expression returned by isDet and row determinism metadata is passed-through unchanged from the input.

    Selection.  Like projection, the selection rewrite makes use of isDet. The selection is extended with a projection operator that updates the row determinism metadata if necessary.

    Cross Product.  Result rows in a cross product are deterministic if and only if both of their input rows are deterministic. Cross products are wrapped in a projection operator that combines the row determinism metadata of both inputs, while leaving the remaining attributes and attribute determinism metadata intact.

    Union.  Bag union already preserves the determinism metadata correctly and does not need to be rewritten.

    Relations.  The base case of the rewrite, once we arrive at a deterministic relation, we annotate each attribute and row as being deterministic.

    Optimizations.  These rewrites are quite conservative in materializing the full set of determinism metadata attributes at every stage of the query. It is not necessary to materialize every and if they can be computed statically based solely on each operator’s output. For example, consider a given that is data-independent, as in a deterministic relation or an attribute defined by a term. has the same value for every row, and can be factored out of the query. A similar property holds for Joins and Selections, allowing the projection enclosing the rewritten operator to be avoided.

    Recall that explanation objects provide a way to analyze the non-determinism in a given result row or cell. Given a query and its normalized form , this analysis requires only and the individual row in the output of used to compute the row or cell being explained.

    We now show how to construct a provenance marker during evaluation of and how to use this provenance marker to reconstruct the single corresponding row of . The key insight driving this process is that the normalization rewrites for cross product and union (Rewrites 3 and 4 in Figure Mimir: Bringing CTables into Practice) are isomorphic with respect to the data dependency structure of the query; and both have unions and cross products in the same places.

    As the basis for provenance markers, we use an implicit, unique per-row identifier attribute called ROWID supported by many popular database engines. When joining two relations in the in-lined query, their ROWIDs are concatenated (we denote string concatenation as ):

    When computing a bag union, each source relation’s ROWID is tagged with a marker that indicates which side of the union it came from:

    Selections are left unchanged, and projections are rewritten to pass the ROWID attribute through.

    The method unwrap, summarized in Algorithm 4, illustrates how a symmetric descent through the deterministic component of a normal form query and a provenance marker can be used to produce a single-row of . The descent unwraps the provenance marker, recovering the single row from each join leaf used to compute the corresponding row of .

    0:  : The deterministic component of a VG-RA normal form query.
    0:  : A ROWID from the inlined query that was normalized into .
    0:  A query to compute row of
      if  is  then
         return  
      else if  is  then
         return  
      else if  is and is ()(then
         return  
      else if  is and is +1 then
         return  
      else if  is and is +2 then
         return  
      else if  is  then
         return  
    Algorithm 4

    The first two approaches provide orthogonal benefits. The partitioning approach results in faster execution of queries over deterministic fragments of the data, as it is easier for the backend database query optimizer to take advantage of indexes already built over the raw data. The inlining approach results in faster execution of queries over non-deterministic fragments of the data, as joins over non-deterministic values do not create a polynomial explosion of possible results. Our third and final approach is a simple combination of the two: Queries are first partitioned as in Approach 1, and then non-deterministic partitions are in-lined as in Approach 2.

    We now summarize our the results of experimental analysis of the two optimizations presented in this paper. We evaluate Virtual C-Tables under the classical normalization-based execution model, and partition-, inline-, and hybrid-optimized execution models. All experiments are conducted using both SQLite as a backend, and a major commercial database termed DBX due to its licensing agreement. Mimir is implemented in Scala and Java. Measurements presented are for Mimir’s textual front-end. All experiments were run under RedHat Enterprise Linux 6.5 on a 16 core 2.6 GHz Intel Xeon server with 32 GB of RAM and a 4-disk 900 GB RAID5 array. Mimir and all database backends were hosted on the same machine to avoid including network latencies in measurements. Our experiments demonstrate that: (1) Virtual C-Tables scale well, (2) Virtual C- Tables impose minimal overhead compared to deterministic evaluation, and (3) Hybrid evaluation is typically optimal.

    Datasets were constructed using TPC-H [?]’s dbgen with scaling factors 1 (1 GB) and 0.1 (100 MB). To simulate incomplete data that could affect join predicates, we randomly replaced a percentage of foreign key references in the dataset with NULL values. We created domain constraint repair lenses over the damaged relations to “repair” these NULL values as non-materialized views. As a query workload, we used TPC-H Queries 1, 3, 5, and 9 modified in two ways. First, all relations used by the query were replaced by references to the corresponding domain constraint repair lens. Second, Mimir does not yet include support for aggregation. Instead we measured the cost of enumerating the set of results to be aggregated by stripping out all aggregate functions and computing their parameter instead111The altered queries can be found at https://github.com/UBOdin/mimir/tree/master/test/tpch_queries/noagg. Execution times were capped at 30 minutes.

    We experimented with two different backend databases: SQLite and a major commercial database DBX. We tried four different evaluation strategies: Classic is the naive, normalization-based evaluation strategy, while Partition, Inline, and Hybrid denote the optimized approaches presented in Sections Mimir: Bringing CTables into Practice, Mimir: Bringing CTables into Practice, and Mimir: Bringing CTables into Practice respectively. Deterministic denotes the four test queries run directly on the backend databases with un-damaged data, and serves as an lower bound for how fast each query can be run.


    (a) TPC-H SF 0.1 (100 MB)

    (b) TPC-H SF 1 (1 GB)
    Figure \thefigure: Performance of Mimir running over SQLite as a percent of deterministic query execution time.

    (a) TPC-H SF 0.1 (100 MB)

    (b) TPC-H SF 1 (1 GB)
    Figure \thefigure: Performance of Mimir running over DBX as a percent of deterministic query execution time.

    Figures Mimir: Bringing CTables into Practice and Mimir: Bringing CTables into Practice show the performance of Mimir running over SQLite and DBX, respectively. The graphs show Mimir’s overhead relative to the equivalent deterministic query.

    Table scans are unaffected by Mimir.  Query 1 is a single-table scan. In all configurations, Mimir’s overhead is virtually nonexistent.

    Partitioning accelerates deterministic results.  Query 3 is a 3-way foreign-key lookup join. Under naive partitioning, completely deterministic partitions are evaluated almost immediately. Even with partitioning, non-deterministic subqueries still need to be partly evaluated as cross products, and partitioning times out on all remaining queries.

    Partitioning can be harmful.  Query 5 is a 6-way foreign-key lookup join where Inline performs better than Hybrid. Each foreign-key is dereferenced in exactly one condition in the query, allowing Inline to create a query with a plan that can be efficiently evaluated using Hash-joins. The additional partitions created by Hybrid create a more complex query that is more expensive to evaluate.

    Partitioning can be helpful.  Query 9 is a 6-way join with a cycle in its join graph. Both PARTSUPP and LINEITEM have foreign key references that must be joined together. Consequently, Inlining creates messy join conditions that neither backend database evaluates efficiently. Partitioning results in substantially simpler nested queries that both databases accept far more gracefully.

    The design of Mimir draws on a rich body of literature, spanning the areas probabilistic databases, model databases, provenance, and data cleaning. We now relate key contributions in these areas on which we have based our efforts.

    Incomplete Data.  Enabling queries over incomplete and probabilistic data has been an area of active research for quite some time. Early research in the area includes NULL values [?], the C-Tables data model [?] for representing incomplete information, and research on fuzzy databases [?]. The C-Tables representation used by Mimir has been linked to both probability distributions and provenance through so-called PC-Tables [?] and Provenance Semirings [?], respectively. These early concepts were implemented through a plethora of probabilistic database systems. Most notably, MayBMS [?] employs a simplification of C-Tables called U-Relations that does not rely on labeled nulls, and can be directly mapped to a deterministic relational database. However, U-Relations can only encode uncertainty described by finite discrete distributions (e.g., Bernoulli), while VC-Tables can support continuous, infinite distributions (e.g., Gaussian). Other probabilistic database systems include MCDB [?], Sprout [?], Trio [?], Orion 2.0 [?], Mystiq [?], Pip [?], Jigsaw [?], and numerous others. These systems all require heavyweight changes to the underlying database engine, or an entirely new database kernel. By contrast, Mimir is an external component that attaches to an existing deployed database, and can be trivially integrated into existing deterministic queries and workflows.

    Model Databases.  A specialized form of probabilistic databases focus on representing structured models such as graphical models or markov processes as relational data. These types of databases exploit the structure of their models to accelerate query evaluation. Systems in this space include BayesStore [?], MauveDB [?] Lahar [?], and SimSQL [?]. In addition to defining semantics for querying models, work in this space typically explores techniques for training models on deterministic data already in the database. The Mimir system treats lens models as black boxes, ignoring model structure. It is likely possible to incorporate model database techniques into Mimir. We leave such considerations as future work.

    Provenance.  Provenance (sometimes referred to as lineage) describes how the outputs of a computation are derived from relevant inputs. Provenance tools provide users with a way of quickly visualizing and understanding how a result was obtained, most commonly as a way to validate outliers, better understand the results, or to diagnose errors. Examples of provenance systems include a general provenance-aware database called Trio [?], a collaborative data exchange system called Orchestra [?], and a generic database provenance middleware called GProM which also supports updates and transactions [?, ?]. It has been shown that certain types of provenance can encode C-Tables [?]. It is this connection that allows Mimir to provide reliable feedback about sources of uncertainty in the results. The VG-Relational algebra used in Mimir creates symbolic expressions that act as a form of provenance similar to semiring provenance [?] and its extensions to value expressions (aggregation [?]) and updates [?].

    Data Curation.  In principle, it is useful to query uncertain or incomplete data directly. However, due to the relative complexity of declaring uncertainty upfront, it is still typical for analysts to validate, standardize, and merge data before importing it into a data management system for analysis. Common problems in the space of data curation include entity de-duplication [?, ?, ?], interpolation [?, ?], schema matching [?, ?, ?, ?], and data fusion. Mimir’s lenses each implement a standard off-the-shelf curation heuristic. These heuristics usually require manual tuning, validation, and refinement. By contrast, in Mimir these difficult, error-prone steps can be deferred until query-time, allowing analysts to focus on the specific cleaning tasks that are directly relevant to the query results at hand.

    Other systems for simplifying or deferring curation exist. For example, DataWrangler [?] creates a data cleaning environment that uses visualization and predictive inference to streamline the data curation process. Similar techniques could be used in Mimir for lens creation, streamlining data curation even further. Mimir can also trace its roots to on-demand cleaning systems like Paygo [?], CrowdER [?], and GDR [?]. In contrast to Mimir, these systems each focus on a specific type of data cleaning: Duplication and Schema Matching, Deduplication, and Conditional Functional Dependency Repair, respectively. Mimir provides a general curation framework that can incorporate the specialized techniques used in each of these systems.

    Uncertainty Visualization.  Visualization of uncertain or incomplete data arises in several domains. As already noted, DataWrangler [?] uses visualization to help guide users to data errors. MCDB [?] uses histograms as a summary of uncertainty in query results. Uncertainty visualization has also been studied in the context of Information Fusion [?, ?]. Mimir’s explanation objects are primitive by comparison, and could be extended with any of these techniques.

    We presented Mimir, an on-demand data curation system based on a novel type of probabilistic data curation operators called Lenses. The system sits as a shim layer on-top of a relational DBMS backend - currently we support SQLite and a commercial system. Lenses encode the result of a curation operation such as domain repair or schema matching as a probabilistic relation. The driving force behind Mimir’s implementation of Lenses are VC-Tables which are a representation of uncertain data that cleanly separates the existence of uncertainty from a probabilistic model for the uncertainty. This enables efficient implementation of queries over lenses by outsourcing the deterministic component of a query to the DBMS. Furthermore, the symbolic expressions used by VC-Tables to represent uncertain values and conditions act as a type of provenance that can be used to explain how uncertainty effects a query result. In this paper we have introduced several optimizations of this approach that 1) push part of the probabilistic computation into the database (we call this inlining) without loosing the ability to generate explanations and 2) partitioning a query based on splitting selection conditions such that some fragments can be evaluated deterministically or can benefit from available indexes. In future work we will investigate cost-based optimization techniques for lens queries and using the probabilistic model for uncertainty in the database to exclude rows that are deterministically not in the result (e.g., if a missing brand value is guaranteed to be either Apple or Samsung, then this value does not fulfill a condition ).

    • [1] Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth, Shubha U. Nabar, Tomoe Sugihara, and Jennifer Widom. Trio: A system for data, uncertainty, and lineage. In VLDB, 2006.
    • [2] B Arab, Dieter Gawlick, Venkatesh Radhakrishnan, Hao Guo, and Boris Glavic. A generic provenance middleware for database queries, updates, and transactions. TaPP, 2014.
    • [3] Bahareh Arab, Dieter Gawlick, Vasudha Krishnaswamy, Venkatesh Radhakrishnan, and Boris Glavic. Reenacting transactions to compute their provenance. Technical report, Illinois Institute of Technology, 2014.
    • [4] Philip A Bernstein, Jayant Madhavan, and Erhard Rahm. Generic schema matching, ten years later. PVLDB, 2011.
    • [5] Ann M. Bisantz, Richard Finger, Younho Seong, and James Llinas. Human performance and data fusion based decision aids. In FUSION, 1999.
    • [6] Jihad Boulos, Nilesh N. Dalvi, Bhushan Mandhani, Shobhit Mathur, Christopher Ré, and Dan Suciu. MYSTIQ: a system for finding more answers by using probabilities. In SIGMOD, 2005.
    • [7] Zhuhua Cai, Zografoula Vagena, Luis Perez, Subramanian Arumugam, Peter J. Haas, and Christopher Jermaine. Simulation of database-valued markov chains using simsql. In SIGMOD, 2013.
    • [8] E. F. Codd. Extending the database relational model to capture more meaning. ACM TODS, 4(4):397–434, 1979.
    • [9] Transaction Processing Performance Council. TPC-H specification. http://www.tpc.org/tpch/.
    • [10] Amol Deshpande and Samuel Madden. MauveDB: supporting model-based user views in database systems. In SIGMOD, 2006.
    • [11] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate record detection: A survey. IEEE TKDE, 19(1):1–16, January 2007.
    • [12] Ronald Fagin, Phokion G. Kolaitis, and Lucian Popa. Data exchange: Getting to the core. In PODS, pages 90–101. ACM, 2003.
    • [13] Robert Fink, Andrew Hogue, Dan Olteanu, and Swaroop Rath. Sprout: a squared query engine for uncertain web data. In SIGMOD, 2011.
    • [14] Todd J. Green, Grigoris Karvounarakis, Zachary G. Ives, and Val Tannen. Provenance in ORCHESTRA. DEBU, 33(3):9–16, 2010.
    • [15] Todd J. Green and Val Tannen. Models for incomplete and probabilistic information. IEEE Data Eng. Bull., 29(1):17–24, 2006.
    • [16] Jiewen Huang, Lyublena Antova, Christoph Koch, and Dan Olteanu. MayBMS: a probabilistic database management system. In SIGMOD, pages 1071–1074. ACM, 2009.
    • [17] Tomasz Imielinski and Witold Lipski Jr. Incomplete information in relational databases. J. ACM, 31(4):761–791, 1984.
    • [18] Ravi Jampani, Fei Xu, Mingxi Wu, Luis Leopoldo Perez, Christopher Jermaine, and Peter J Haas. MCDB: a monte carlo approach to managing uncertain data. In SIGMOD, 2008.
    • [19] Shawn R. Jeffery, Michael J. Franklin, and Alon Y. Halevy. Pay-as-you-go user feedback for dataspace systems. In SIGMOD, pages 847–860. ACM, 2008.
    • [20] Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. Wrangler: Interactive visual specification of data transformation scripts. In SIGCHI, 2011.
    • [21] G. Karvounarakis and T.J. Green. Semiring-annotated data: Queries and provenance. SIGMOD Record, 41(3):5–14, 2012.
    • [22] Oliver Kennedy and Christoph Koch. PIP: A database system for great and small expectations. In ICDE, 2010.
    • [23] Oliver Kennedy and Suman Nath. Jigsaw: efficient optimization over uncertain enterprise data. In SIGMOD, 2011.
    • [24] Yoonkyong Lee, Mayssam Sayyadian, AnHai Doan, and Arnon S Rosenthal. etuner: tuning schema matching software using synthetic scenarios. VLDB J., 16(1):97–122, 2007.
    • [25] J. Letchner, C. Re, M. Balazinska, and M. Philipose. Access methods for markovian streams. In ICDE, March 2009.
    • [26] Chris Mayfield, Jennifer Neville, and Sunil Prabhakar. Eracer: A database approach for statistical inference and data cleaning. In SIGMOD, 2010.
    • [27] Robert McCann, Warren Shen, and AnHai Doan. Matching schemas in online communities: A web 2.0 approach. In ICDE, 2008.
    • [28] Fabian Panse, Maurice van Keulen, and Norbert Ritter. Indeterministic handling of uncertain decisions in deduplication. JDIQ, 4(2):9:1–9:25, March 2013.
    • [29] Erhard Rahm and Philip A Bernstein. A survey of approaches to automatic schema matching. VLDB J., 10(4):334–350, 2001.
    • [30] Sunita Sarawagi and Anuradha Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002.
    • [31] Sarvjeet Singh, Chris Mayfield, Sagar Mittal, Sunil Prabhakar, Susanne Hambrusch, and Rahul Shah. Orion 2.0: Native support for uncertain data. In SIGMOD, pages 1239–1242. ACM, 2008.
    • [32] Ion-George Todoran, Laurent Lecornu, Ali Khenchaf, and Jean-Marc Le Caillec. A methodology to evaluate important dimensions of information quality in systems. J. DIQ, 6(2-3):11:1–11:23, June 2015.
    • [33] Daisy Zhe Wang, Eirinaios Michelakis, Minos Garofalakis, and Joseph M. Hellerstein. Bayesstore: Managing large, uncertain data repositories with probabilistic graphical models. PVLDB, 1(1):340–351, 2008.
    • [34] Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. CrowdER: Crowdsourcing entity resolution. PVLDB, 5(11):1483–1494, 2012.
    • [35] Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. Guided data repair. PVLDB, 4(5):279–289, 2011.
    • [36] Ying Yang, Niccolò Meneghetti, Ronny Fehling, Zhen Hua Liu, and Oliver Kennedy. Lenses: An on-demand approach to etl. PVLDB, 8(12):1578–1589, 2015.
    • [37] Maria Zemankova and Abraham Kandel. Implementing imprecision in information systems. Information Sciences, 37(1):107 – 141, 1985.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
23198
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description