AlphaClean: Automatic Generation of Data Cleaning Pipelines

AlphaClean: Automatic Generation of
Data Cleaning Pipelines

Sanjay Krishnan University of Chicago skr@cs.uchicago.edu  and  Eugene Wu Columbia University ewu@cs.columbia.edu
Abstract.

The analyst effort in data cleaning is gradually shifting away from the design of hand-written scripts to building and tuning complex pipelines of automated data cleaning libraries. Hyperparameter tuning for data cleaning is very different than hyperparmeter tuning for machine learning since the pipeline components and objective functions have structure that tuning algorithms can exploit. This paper proposes a framework, called AlphaClean, that rethinks parameter tuning for data cleaning pipelines. AlphaClean provides users with a rich library to define data quality measures with weighted sums of SQL aggregate queries. AlphaClean applies generate-then-search framework where each pipelined cleaning operator contributes candidate transformations to a shared pool. Asynchronously, in separate threads, a search algorithm sequences them into cleaning pipelines that maximize the user-defined quality measures. This architecture allows AlphaClean to apply a number of optimizations including incremental evaluation of the quality measures and learning dynamic pruning rules to reduce the search space. Our experiments on real and synthetic benchmarks suggest that AlphaClean finds solutions of up-to 9x higher quality than naively applying state-of-the-art parameter tuning methods, is significantly more robust to straggling data cleaning methods and redundancy in the data cleaning library, and can incorporate state-of-the-art cleaning systems such as HoloClean as cleaning operators.

1. Introduction

Data cleaning is widely recognized as a major challenge in almost all forms of data analytics (nytimes, ). Analysts report spending upwards of 80% of analysis time during data cleaning and preparation. Improperly handled errors can affect the performance and accuracy of downstream applications such as reports, visualizations, and machine learning models. In response, the research community has developed a number of sophisticated data cleaning libraries for detecting and repairing errors in large datasets (dc, ; rekatsinas2017holoclean, ; DBLP:journals/pvldb/KrishnanWWFG16, ; DBLP:conf/sigmod/ChuIKW16, ; mudgal2018deep, ; doan2018toward, ). As a consequence, the burden on the analyst is gradually shifting away from the design of hand-written data cleaning scripts to building and tuning pipelines of automated data cleaning libraries (krishnan2016hilda, ).

Systems to automatically optimize these pipelines and their parameters are desirable. An initial architecture is to directly apply recent hyperparameter tuning approaches for machine learning pipelines and neural network model search (li2017hyperband, ; sparks2017keystoneml, ; baylor2017tfx, ; golovin2017google, ; liaw2018tune, ). We can treat an entire data cleaning pipeline as a parametrized black box exposing tunable parameters such as confidence thresholds and editing penalties. We can quantify the success or failure of a parameter setting with a final data quality objective function (e.g., number of tuples violating integrity constraints or cross-referencing with master data). The tuning system will then search over possible parameter settings to optimize the objective function.

Hyperparameter tuning systems are fundamentally ill-suited for data cleaning applications. They only assume query access to the final objective value and neglect any structure and opportunities for shared computation in the pipeline. For example, even if the objective function was based solely on integrity constraint violations, a black-box tuning system would not recognize that integrity constraints can be incrementally checked without full re-computation (fan2014incremental, ). Similarly, these systems would not recognize opportunities for re-ordering the application of libraries and excluding irrelevant libraries.

We present a new framework called AlphaClean whose main insight is that a common intermediate representation for repairs can facilitate more efficient data cleaning pipeline optimization. Many popular data cleaning libraries actually “speak the same language”, where all of their repairs can be cast as cell-replacements operations (rekatsinas2017holoclean, ; DBLP:conf/sigmod/ChuIKW16, ; DBLP:journals/pvldb/KrishnanWWFG16, ). In AlphaClean, rather than treating the entire pipeline as a single parameterized black-box, the system assess the fine-grained repairs from each parameter setting and re-orders, excludes, and merges accordingly.

Users interface their existing data cleaning libraries to AlphaClean with minimal code that exposes an input interface to set parameters and an output interface to collect proposed edits to a table. Libraries can be as narrow or as general as the user desires. For example, they can be domain specific string matching functions or entire data cleaning systems such as HoloClean (rekatsinas2017holoclean, ). Each of these libraries suggests candidate repairs to a central pool. Users define a data quality function (the objective) with SQL aggregation queries (allowing for UDAF’s) over the input table. This subsumes popular quality measures such as integrity constraint violations (ilyas2015trends, ) and numerical outlier detection (bailis2017macrobase, ), and can readily express application-specific quality measures such as machine learning training accuracy or goodness-of-fit to a model. Separate threads search through the pool of candidates to decide a sequence of repairs to construct (a cleaning pipeline) that optimizes this quality function. AlphaClean works in an “anytime” fashion where results are progressively returned to the user.

The search algorithm is implemented as a greedy tree-search that sequences the repairs (russell2016artificial, ). The space of possible repair sequences is enormous (our experiments encounter branching factors in the millions). Thus, it is important to avoid fully evaluating a path’s quality and expanding unpromising paths. AlphaClean dynamically learns a model to avoid executing the pipeline and quality function in order to evaluate a given path, and can be tuned to have a low false positive rate when pruning candidate paths. Furthermore, the tree search can be easily parallelized across both candidate paths, as well as across partitions of the dataset based on properties of the quality function. We use periodic synchronization to update the prediction model across parallel searches and merge transformations that repair disjoint sets of tuples.

AlphaClean contributes a new architecture to data cleaning optimization. This flexibility in composing different quality functions can help users across different domains evaluate different notions of quality within a single system. The intermediate representation and generate-then-search paradigm allows for intelligent composition of multiple systems. Even in cases where a existing cleaning system is specifically designed for the errors in the dataset (e.g., integrity constraints), AlphaClean can combine other cleaning operators to further improve the repairs. Our experiments show that one of the most powerful benefits of AlphaClean comes from the ensembling effects and its natural robustness to redundancy and/or distracting pipeline components.

2. Background

We study parameter tuning for systems that address cell inconsistencies, where record values are missing, incorrect, contain inconsistent references to the same entities, or contain artifacts from the extraction process.

2.1. Motivation

Our goal is to develop techniques to automatically generate and tune data cleaning pipelines based on user-specified quality characteristics. Thus, the user can primarily focus on composing and expressing data quality issues, and allow the system to explore the space of physical cleaning plans. We would like the search procedure to be progressive, in the sense that it quickly generates acceptable cleaning plans, and refines those plans over time. Thus, the user can immediately assess her hypothesis, or test multiple hypotheses in parallel.

Figure 1. Typical data cleaning pipeline. The user finds that analysis results (of SQL query, ML model, web application, etc) are suspicious and iteratively (1) composes a quality function to characterize the suspicious quality issues, and (2) modifies the data cleaning pipeline to address the errors. AlphaClean improves this human-in-the-loop process by providing an expressive, composable quality function, and automatically searching for cleaning pipelines.

This iterative pattern makes data cleaning a human-in-the-loop problem, where the developer explores a large space of data quality issues and data cleaning programs (Figure 1). However, the data cleaning systems ecosystem is diffuse, with separate systems for constraint resolution  (rekatsinas2017holoclean, ), cleaning in machine learning pipelines (DBLP:journals/pvldb/KrishnanWWFG16, ), entity resolution (mudgal2018deep, ; doan2018toward, ), and crowdsourcing (DBLP:journals/pvldb/HaasKWF015, ). Each of these systems has its own idiosyncrasies and parameters, and tuning even one of these systems can be a daunting challenge. Real-world datasets have mixes of errors (krishnan2016hilda, ) and often require multiple systems to clean (DBLP:conf/sigmod/ChuIKW16, ). Although these systems make it easier to construct and execute a pipeline, the space of possible operator pipelines and parameterizations of each operator is exponential in the number of operators, parameters, and pipeline depth, and is infeasible for developers to manually search.

2.2. Challenges

We could start by considering the recent work in hyperparameter tuning for machine learning, which identifies the optimal assignment of hyperparameters to maximize an objective function (e.g., training accuracy for ML models). Several systems have been built to run hyperpameter and neural network model search at scale (li2017hyperband, ; sparks2017keystoneml, ; baylor2017tfx, ; golovin2017google, ; liaw2018tune, ). For single threaded search, the state-of-the-art remains to be Bayesian optimization, e.g., Python Hyperopt (bergstra2013hyperopt, ). Since Bayesian optimization is inherently sequential, for parallel and distributed settings, the community is increasingly studying randomized and grid search schemes (li2017hyperband, ; liaw2018tune, ; golovin2017google, ). For a pipeline of up to cleaning components, we can create a parameter that represents the operator type in each of the pipeline slots, along with additional operators to tune each operator in each pipeline slot. A hyperparameter tuning algorithm will then select and assign parameter values to a sequence of operators. Although this approach is possible, it ignores important aspects of data cleaning problems that can enable more efficient and flexible approaches.

Quality Function Structure: Hyperparameter tuning algorithms are also called “black-box” optimization algorithms because they only assume oracular access to the optimization objective (i.e., evaluate the quality of a given plan). In contrast, the objectives in data cleaning have far more structure. If the objective is to minimize functional dependency violations, it would be wasteful to recompute all violations after every repair. One could incrementally evaluate update the objective from the set of modified keys. This is also true in time-series data cleaning problem where quality measures are tied to certain windows of data–there is no point re-evaluating the whole objective if only a small window is affected. In other words, data quality measures commonly support efficient incremental evaluation, and satisfy properties that enable data partitioning. Neglecting this structure leads to a large amount of duplicated effort for every parameter setting evaluated.

Figure 2. 10% of a dataset of dictionary words are duplicated with randomly generated spelling errors. The dataset is to be cleaned with a similarity matcher and a spell checker. Holistically, tuning the parameters of both with python hyperopt (BB-Full) is inefficient due to interactions between the two data cleaning options. It takes over 3x the amount of search time for the joint optimization to exceed the best tuned single cleaning method (BB-Edit and BB-SpellCheck)

Data Cleaning Method Structure: Similarly, black-box search algorithms would treat the data cleaning pipeline as a monolithic parametrized unit. This leads to an attribution problems, namely, which parameter change is responsible for an increase (or decrease) in objective value. Figure 2 illustrates this concern on a toy data cleaning problem, with a hyperparameter search based on Tree-structured Parzen Estimator (TPE) (shahriari2016taking, )111Implemented using python hyperopt. We corrupted 1000 dictionary words so that 10% are duplicated with randomly generated spelling errors affecting 1-3 characters. The quality function is the F1 score of the cleaned dataset as compared to the ground truth. We consider two parameterized operators: edit_dist_match(thresh) is a string edit distance similarity matcher with a tunable threshold, and ispell(rec) is a spell checker with a tunable recommendation parameter based on the distance between the dictionary word and the misspelled word. The two operators partially overlap in their cleaning behavior, and we will see how it affects the search problem below.

We compare hyperparameter search for three fixed pipelines: single-operator pipelines (edit_dist_match) and (ispell), and a joint pipeline (edit_dist_match, ispell). By fixing the operator pipeline, the search algorithm only needs to learn parameterizations of the operators. Although we expect the joint pipeline to perform the best, Figure 2 shows that there is a trade-off between runtime and data quality (measured as F1 score). It takes 3 amount of search time for the joint pipeline to exceed the best single-operator pipeline. In contrast, the single operator pipelines quickly converge to an F1 score of . The reason is because the two operators overlap in functionality (some duplicates can be fixed by ispell or edit_dist_match), which forces the join optimization to explore redundant parameter settings that have the same cleaning results. In practice, pipelines and the set of operators can be much larger, thus the likelihood of redundant operators, or even operators that reverse changes made by previous operators, is high.

But this issue is often not present in data cleaning problems. If we consider data cleaning operators that preserve schema (same input and output types), they can be reordered, queried/optimized independently, and ensembled in ways that general machine learning pipelines cannot. For example, what if we independently optimized both single-operator pipelines (edit_dist_match) and (ispell), and then took the consensus between their repairs? Such operations are disallowed in current hyper-parameter tuning approaches.

This is the main intuition behind AlphaClean: rather than treating a pipeline as a monolithic parametrized unit, we decompose it into its constituent repairs. The system then interleaves those repairs that improve the objective function. These repairs can be generated asynchronously in a thread of workers that query each cleaning operator with different parameters–making it robust to operators that are slow or straggle on difficult datasets. We also include the curve for when solving this problem with AlphaClean on Figure 2; the next sections describe the design to accelerate such cleaning problems.

Figure 3. AlphaClean decouples sampling from the parameter space from search. This allows the user to iterate quickly by observing early best-effort results.

3. Architecture and Overview

Our goal is to develop a system to automatically generate and tune data cleaning pipelines based on user-specified quality characteristics. Thus, the user can primarily focus on composing and expressing data quality issues. We would like the search procedure to be progressive, in the sense that it quickly generates acceptable cleaning plans, and refines those plans over time. Thus, the user can immediately assess her hypothesis, or test multiple hypotheses in parallel.

3.1. Interfacing Data Cleaning Frameworks

The first component of the system is the API interface between existing data cleaning libraries and AlphaClean. We encapsulate the logic of such libraries into a unit we call a data cleaning framework, a parametrized function that transforms a dataset. We assume these transformations preserve schema and do not delete/add records. There are two classes that are important to note, Parameter and Repair. Parameter is a class that represents the input parameters to a particular framework. Repair is a class that represents the transformations that the framework makes to a given dataset for a particular parameter setting. Section 4 will describe how repairs are represented and composed in more detail.

Accordingly, each framework is then interfaced to AlphaClean with the following API calls:

Iterates through all possible parameter settings.

getParameterSpace(): Iterator<Parameter>

Choose a particular parameter setting for the framework.

setParameter(Parameter val)

Iterate through all repairs that framework would like to apply to the dataset.

collectRepairs(): Iterator<Repair>
Example 3.1 ().

The spell checker ispell(d, attr) in Example LABEL:e:1 can be tuned by setting a maximum edit distance between the dictionary word and the attribute value . The parameter space is thus for , and all attributes in the relation for . Similarly, edit_match(d, attr) is an edit distance matcher that searches for other attr values in within an edit distance of , and sets to the most frequent value. In this case, the value of the assignment is computed dynamically. Finally chase(fd, R) is parameterized by a functional dependency from a user-provided set of FDs, and will run the chase algorithm (Deutsch2008TheCR, ) for .

3.2. Quantifying Data Quality

In machine learning, the objective function for hyper-parameter tuning is often given as the cross validation error of a model. In data cleaning applications, we may not always have objective ground truth. A quality function measures a specific notion of cleanliness for a relation, and is used as the objective function for the tuning. This is a proxy for accuracy defined by the user. These quality functions are represented in terms of SQL aggregation queries. The user provides a list of SQL aggregates (including UDAF’s) and a set of weights to combine these aggregates. Section 5 describes examples of quality functions and optimizations that we can apply if we have SQL descriptions.

Example 3.2 ().

In Example LABEL:e:1, Lisa writes a functional dependency city_namecity_code check as one quality function. She writes another query that counts number of singleton city names. The final quality function is a weighted sum of the two functions.

3.3. Asynchronous Architecture

We propose a generate-then-search framework, that decouples the execution of the frameworks and pipeline quality evaluation (Figure 3). Each framework runs in a separate thread (or process) and continuously reruns with new parameters provided by the Parameter Sampler. Its outputs are added to the Repair Pool. The Searcher removes repairs from this pool to expand the set of candidate cleaning pipelines, and periodically sends the best pipelines so far to the user. If the pool exceeds a maximum size, it applies back pressure to pause the cleaning operators until the Searcher has removed a sufficient number of conditional assignments the pool. In practice, the cost to generate candidate assignments is far higher than the search procedure, and back pressure was not needed in our experiments. The Quality Evaluator computes the quality of a candidate pipeline. To make this framework practical, there are several search optimizations and heuristics that we use. Section 6 describes the search algorithm in detail.

3.4. Discussion

The key benefit of this asynchronous approach is that the search process does not block on a straggler cleaning framework. It is common that parameters affect their runtime. For example, inference thresholds and partitioning parameters can have “cliffs”, where a small change in parameters can drastically slow down the performance of the method. Including such parameter settings in the search process naively would block the entire system. In contrast, AlphaClean will simply sample from faster operators until the slow inference task completes. In fact, this design explicitly highlights the connection between the explored search space and resource scheduling. For instance, allocating more CPU resources to more promising operators can affect how the search space is explored.

One drawback of the asynchronous approach is that the Parameter Sampler is oblivious of the search process, so the cleaning operators may generate repairs that are not useful. The Parameter Sampler does not attempt to preferentially sample from “more promising” parameter spaces, and simply uses uniform sampling. Similarly, the Library does not perform resource scheduling, and simply allocates one thread per cleaning operator, and each process executes parameter assignments serially. We will show that using machine learning to identify promising search paths can alleviate this concern.

4. Repairs

A main insight of AlphaClean is that a common intermediate representation for repairs can facilitate more efficient data cleaning pipeline optimization. Data cleaning frameworks that are interfaced to AlphaClean asynchronously pool together suggested data repairs.

4.1. Repair Format

Repairs are specified as “conditional assignments”, which are sentences of the form “if a tuple satisfies a condition, then set an attribute to a specified value”:

    ca(r):
      if pred(r): r[attr] = v
      return r

The predicates are in a restricted language consisting of equality clauses and single attribute inequalities, e.g.,

r[city_code] == ’NY’
r[id] > 3
r[name, code] == (’New York’, ’NY’)

This restriction on predicates allows for efficient conflict testing; determining whether two repairs are independent of each other. Despite the restriction, it is still expressive enough to capture many important types of data cleaning. Because, in the degenerate case, we can simply use the tuple’s primary key as a a predicate attribute. In that case, for each tuple there is a separate conditional assignment.

Example 4.1 ().

ca(code.prefix(‘‘NY’’), code, ‘‘NYC’’) sets the code to “NYC” for all records where the city code starts with “NY”. This single condition could be replaced with three operations with predicates id=2, id=3, id=4 for the example table, where operation can be executed and added to a cleaning pipeline independently.

The interesting cases are when we can aggregate repairs together under a single, more-informative predicate. For example, we often find this with numerical outlier detection libraries that identify a threshold on attribute values after which they are deemed outliers.

Example 4.2 ().

ca( Population < 0, Population, NULL’) sets a population attribute to NULL for all records where it is less than 0.

4.2. Operations Over Repairs

A cleaning pipeline is defined as composition of conditional assignments, where . Note that ’s changes may be overwritten by . A composition can similarly be evaluated over a relation : . The next section will describe the interface to evaluate the quality of a cleaning pipeline in more detail. The basic search problem that underpins AlphaClean is a search over possible compositions of conditional assignments to optimize a quality function :

Problem 1 (Search Problem).

Given quality function , a set of frameworks , relation , find valid plan (composition of conditional assignments) that optimizes :

returns the cleaned table, and can potentially be applied to any table that is union compatible with .

In general, conditional assignments are not commutative, meaning that . However, the intermediate representation allows us to efficiently test if conditional assignments are commutative and can be run in parallel. If and ’s predicates are non-overlapping and their assignment values do not affect this independence, then their operations are commutative (). Based on this observation, we use a heuristic to opportunistically merge candidate pipelines if they are disjoint in such as way and both pipelines independently increase the quality function.

5. Quality Functions

Now, we have to define the API for assessing the quality of a pipeline. A quality function measures a specific notion of cleanliness for a relation, and is used as the cost model for the pipeline search. Our goal is to define these quality functions in a sufficiently “white box” way to be able share computation when possible over different search expansions.

5.1. SQL Aggregate Queries

Quality functions are defined is in terms of SQL aggregation queries. For example, the number of functional dependency violations (e.g., ) is expressible as:

  q1(T): SELECT count(1)
         FROM T as c1, T as c2,
         WHERE (c1.city_name == c2.city_name) AND
               (c1.city_code <> c2.city_code)

Conditional functional dependency violations is a well-studied quality function, and many systems optimize for this class of objectives (rekatsinas2017holoclean, ; DBLP:conf/sigmod/ChuIKW16, ).

However, this example highlights that even seemingly simple data cleaning problems can require the flexibility to express multiple quality functions. For example, record 1 does not violate the above functional dependency, and will be missed by most functional dependency solvers. Suppose the analyst observed a histogram of city names and noted that there were a large number of singleton entries. Thus, she could write a second quality function that counts the number of singleton entries. This is an example of a quality measure that other systems such as Holoclean and Holistic Data Cleaning do not support as input (rekatsinas2017holoclean, ; DBLP:conf/sigmod/ChuIKW16, ):

  q2(T): SELECT count(1)
         FROM ( SELECT count(1) as cnt FROM T,
                GROUP BY city_name HAVING cnt = 1)

Finally, the user can embed the downstream application as a user defined function (UDF). For instance, the machine learning model accuracy can be added as a quality function that calls a UDF model.eval(). In our experiments using the London Air Quality benchmark, we show how a parametric auto-regressive model that measures curve smoothness can be expressed as a quality function:

  q3(T): SELECT avg(err) AS acc
         FROM ( SELECT model.eval(X) = Y FROM T )

AlphaClean lets the user compose linear combinations of quality functions together. We model the composition over individual quality functions as . For example, in the example, and captures the semantic functional dependency issues as well as the syntactic string splitting errors in a single cost model. Our experiments simply set .

We designed the quality function in this way for several reasons. SQL aggregations can be incrementally computed and maintained, and can be efficiently approximated. This is important because each conditional assignment typically modifies a small set of records, and thus allows efficient re-computation that scales to the number of cleaned records rather than the size of the dataset. The linear compositions enables parallelization across each term, and the aggregation functions are typically algebraic functions that can be parallelized across data partitions. The combination of incremental maintenance, and data and quality function parallelization speeds up evaluation by up to 20x in our experiments.

5.2. Incremental Maintenance

Most cleaning operators modify significantly fewer records than the entire dataset. Since quality functions are simply aggregation queries, AlphaClean can incrementally evaluate the quality function over the fixed records rather than the full dataset. This is exactly the process of incremental view maintenance, and we use standard techniques to incrementally compute quality functions.

Suppose we have relation , quality function , and a set of conditional assignment expressions . When possible, AlphaClean computes once and then for each of the expressions compute a delta such . For many types of quality functions such incremental computation can be automatically synthesized and can greatly save on computation time. Currently, this process is not automatic and AlphaClean relies on programmer annotations for incremental updates. It is not hard to automate this process when possible, but this is orthogonal to the topic studied in this paper. The property that we would have to test is self-maintanability (gupta1996data, ), and we would have to implement delta computation in relational algebra.

Let us consider a concrete example with the quality function , a functional dependency checker, from the previous section. is the resulting relation after applying c to all of the records. Let be the set of records that satisfy the predicate of the conditional assignment expression and be the resulting transformed records. can be expressed in relational algebra in the following way:

can be described in terms of :

leading to the following expression:

Evaluating this quality function using a hash join reduces the incremental evaluation cost to roughly linear in the size of the number records changed, rather than the size of the relation.

6. Search Algorithm

This section describes our system optimizations.

Figure 4. In each iteration, each worker starts with a subset of the priority queue (boxes). The driver distributes conditional assignments (circles) to generate candidate pipelines (box-circles). A series of synchronization points identify the globally top candidates and redistributes them across the workers.
Data: Q, S
1 Pruned = //empty priority queue for   do
2       = for   do
3             := := := if  and  then
4                  
5            
6      Pruned.push()
return Pruned
Algorithm 1 Pruning Disjoint Paths

6.1. Parameter Sampler

By default, users simply specify an operator’s parameter domain as a list of values, and the Parameter Sampler uniformly samples from the domain. Non-uniform sampling is possible, and worth exploring in future work. In addition, users can specify two types of parameter properties, for which AlphaClean can apply search optimizations:

  • Attribute Name Parameters: If the parameter represents an attribute in the database, then AlphaClean can infer the domain of allowable values. For example, a numerical outlier detection algorithm might apply to a single attribute or a subset of attributes. AlphaClean can also prune the paramater space by pruning attribute names that are irrelevant to the quality function.

  • Threshold Parameters: Numeric parameters are often used as thresholds, inference parameters, or confidence bounds. For these, users specify the most and least restrictive ends of the value domain, and AlphaClean will sweep the space from most to least restrictive. For instance, ispell only uses the dictionary if the attribute value is within rec characters of the dictionary word. Thus, AlphaClean will initially sample and gradually relax the threshold.

6.2. Parallelization

Even with incremental evaluation, composing and evaluating is the single most expensive search operation. Thus, we parallelize across candidate pipelines and data partitions. The prototype uses Ray (ray, ) to schedule and parallelize over multiple CPUs and machines.

Search Parallelism: Conceptually, we execute all expansions for a given plan in parallel. We materialize the incremental deltas in memory, and evaluate the quality of each in parallel using a thread pool. Each thread drops a given if its quality is lower than the maximum quality from the previous WHILE iteration or the local thread. At the end of the WHILE iteration, the threads synchronize to compute the highest quality, and flush the remaining candidates using the up-to-date quality value.

The implementation of this conceptual parallelization is a little bit more complex. Each worker is given a subset of candidate pipelines to locally evaluate and prune, and the main challenge is to reduce task skew through periodic rebalancing. We use a worker-driver model with workers (Figure 4).

Let be the set of candidate pipelines (e.g., ) to incrementally compute the quality function.

Note that the worker-local top- candidates are a superset of the top- global candidates because the best local quality is the global best. Thus the workers synchronize with the driver to identify the global best candidate and further prune each worker’s top candidates. At this point, all candidate pipelines are within of the globally best candidate, but their distribution across the workers can be highly skewed. AlphaClean performs a final rebalancing step, where each worker sends the number of un-pruned candidates to the driver. Workers with more than of the total number redistribute the extras to workers with too few candidates. When redistributing, workers communicate directly and do not involve the driver (e.g., Worker 2 sends to Worker 1). If the total number is , then candidates are randomly chosen to be replicated. Only the pipelines and their qualities are sent; the pipeline results are re-computed by the receiving worker. This ensures that the priority queue in the next iteration is evenly distributed across all workers.

Data Parallelism: Many large datasets are naturally partitioned, e.g., by timestamp or region. The idea is to partition the dataset in such a way that errors are local to a small number of records. This means that a fix for a given record does not affect the other records outside of the partition. There is a relationship between partitioning and the quality functions defined. For example, quality functions derived from functional dependencies can define blocks by examining the violating tuples linked through the dependency. Similarly, users can define custom partitioning functions. In our current implementation, we partition the input relation by row by user-specified blocking rules.

6.3. Learning Pruning Rules

Traditionally, data cleaning systems manually implement pruning heuristics for a fixed quality function that can work for any dataset. For example, the Chase  used in functional dependency resolution does not make an edit to the table unless it enforces at least one tuple’s FD relationship. Similarly, in entity matching problems, one restricts the search to only tuples that satisfy a blocking (clustering) criteria. These can be viewed as pre-conditions over the search space.

Our idea is to learn pruning rules that are data and quality function dependent. Our hypothesis is that data errors are often systematic in nature, and correlated with specific attributes and their values. Our pruning optimization seeks to distinguish conditional assignments that are likely to contribute to the final cleaning pipeline, and those that will not. To do so, the basic strategy is to independently execute the Search algorithm on partitions of the dataset, and learn a prediction model.

Approach: As described in Section 6, AlphaClean uses data parallelism to execute search for each block of the dataset in parallel. Thus, each block results in an optimal cleaning pipeline. AlphaClean models the optimal cleaning pipeline for each block as a set of training examples. Specifically, cach conditional assignment in a block’s optimal cleaning plan can be labeled as a positive training example, while all other conditional assignments that were not used are negative examples.

As AlphaClean processes more blocks, it trains a classifier to predict whether a given transformation will be included in the optimal program, based on the training examples across the blocks. In our approach, the prediction model is over the data transformations and not the data; is this sense, AlphaClean learns pruning rules in a dynamic fashion. New expansions are tested against the classifier before the algorithm proceeds. Internally, AlphaClean uses a Logistic Regression classifier that is biased towards false positives (i.e., keeping a bad search branch) over false negatives (e.g., pruning a good branch). This is done by training the model and shifting the prediction threshold until there are no False Negatives.

Featurization: Note to use this approach, we have to featurize each conditional assignment into a feature vector. We do not featurize the data as in other learning-based data cleaning systems. Now, we describe how each conditional assignment is described as a feature vector. Let be a list of all of the attributes in the table in some ordering (e.g., ). Every conditional assignment statement is described with a predicate pred, targ a target attribute, and a target value. is the subset of attributes that satisfy the predicate and is the singleton set representing the target attribute. Each of these sets can be turned into a -dimensional binary vector, where 1 represents presence of an attribute, and we call these vector and respectively. Then, we include information about the provenance of the conditional assignment , from which data cleaning method it was generated and what parameter settings. This feature called is contains a 1-hot feature vector describing which is the source data cleaning method and any numerical parameters from the source method.

We believe this is one of the reasons why a simple best-first search strategy can be effective. For the initial blocks, AlphaClean searches without a learned pruning rule in order to gather evidence. Over time, the classifier can identify systematic patterns that are unlikely to lead to the final cleaning program, and explore the rest of the space. The features guide AlphaClean towards those data cleaning methods/parameter settings that are most promising on previous blocks. AlphaClean uses a linear classifier because it can be trained quickly with few examples. However, we speculate that across a sufficient number of cleaning problems that share common set of data transformations (say, within the same department), we may adapt a deep learning approach to automatically learn the features themselves.

Figure 5. Tuning Against Quality Functions. On the x-axis is the search time in seconds, and on the y-axis is the suboptimality w.r.t the quality of the ground truth data. (A) Hospital dataset, (B) London Air Quality Dataset, (C) Physician Dataset. In all three datasets, AlphaClean converges to a more accurate solution faster than the alternatives.
Figure 6. Tuning Against Gold-Standard Data. On the x-axis is the search time in seconds, and on the y-axis is the inaccuracy w.r.t ground truth. (A) Hospital dataset, (B) London Air Quality Dataset, (C) Physician Dataset. In all three datasets, AlphaClean converges to a more accurate solution faster than the alternatives.

7. Experiments

Our goal is to 1) compare AlphaClean with modern blackbox hyper-parameter tuning algorithms, 2) understand its strengths and failure cases, and 3) highlight the promise of a general search-based method through comparisons with data cleaning systems (e.g., HoloClean (rekatsinas2017holoclean, )) that are specialized to specific classes of data errors.

7.1. Datasets and Baselines

We focus on three datasets used in prior data cleaning benchmarks. Each dataset exhibits different dataset sizes and data cleaning needs. Each dataset also provides ground truth cleaned versions. We also describe the default cleaning operator libraries for each dataset, informed on prior benchmarks, as well as baseline hyper-parameter tuning methods.

7.1.1. Datasets and Cleaning Benchmarks

Hospital: This dataset contains UK Hospital information, and used in (he2016interactive, ; rekatsinas2017holoclean, ). Roughly 5% of the cells are corrupted with mispellings, missing values, or other inconsistencies. The default quality function tries to minimize the number of singleton cities with only one hospital, because it may be due to data errors (example in Section 4). The default library contains: ispell.replace(thresh, attr) as described in Section 4 replaces the attribute value if it is within a threshold of a dictionary value, minhash.replace(thresh, attr) runs the minhash de-duplication algorithm (broder2000min, ) to find similar values and sets them to be equal, and fd.replace(fd) enforces a functional dependency with the chase algorithm (aho1979theory, ).

London Air Quality (LAQ): The dataset contains measurements of air pollution particulate matter from London boroughs (londonair, ). Around 2% of the measurements (cells) are corrupted by a variety of different outliers including very large values as well as clipped very small values. As the errors are mostly numerical in this dataset, the default quality function fits an autoregressive model to the windows and computes the average error to the fit model:

    SELECT AVG(autoregression.error(window))
    FROM data [Range 5 hours];

The default library consists of parametrized outlier detector methods from dBoost (mariet2016outlier, ) and pyod (pyod, ). Both detect outliers and set them to the last known non-outlier value. dBoost.histogram(peak_theshold, outlier_threshold, window_size) detects peaks in histograms of sliding windows of the data, dBoost.gaussian(K, window_size) thresholds values outside standard deviations from the mean of a sliding window, and pyod.pca(outlier_threshold, window_size) applies PCA to sliding windows and thresholds them by the sum of weighted projected distances to the eigenvector hyperplane.

Physician: The Physician Compare dataset was used in HoloClean (rekatsinas2017holoclean, ), and contains information on medical professionals and the primary care practice they are associated with. It contains misspellings, inconsistencies, and missing data. The default quality function is the set of 8 functional dependencies defined in (rekatsinas2017holoclean, ). The default library contains the operators for the Hospital dataset as well as HoloClean, which is wrapped as the operator (holoclean.replace(fd, threshold)) that enforces a functional dependency using HoloClean’s suggested cell value fix if its confidence exceeds a threshold.

7.1.2. Baselines

We consider the following baseline techniques by encoding the data cleaning problem into a large set of parameters as described in Section 2. To speed up the methods, we use the incremental computation optimization for quality function evaluation. Every component is given the same time limit and has to return its best cleaning strategy by that time.

Grid Search: We cascade all of the operators into a fixed order and treat it as a monolithic parametrized unit. We search over all possible values of the discrete parameters and a grid of values over the continuous parameters, and evalute the quality at the end. We use grid search as a baseline as it is easy to parallelize and compare at scale.

Hyperopt: We use exactly the same setup as Grid Search but instead of searching over a grid, we use python hyperopt to perform a Bayesian optimization and intelligently select parameters. We use hyperopt as a baseline for an optimized single-threaded search through the parameter space.

Greedy: We tune each data cleaning algorithm independently with respect to the original data. We use a grid search scheme on each component independently. We use greedy as a baseline to illustrate the benefits and drawbacks of individual optimization of each data cleaning system.

7.2. End-to-End Experiments

We first compare AlphaClean with the baseline parameter tuning methods on the three benchmark datasets. For all three benchmarks, we add a component to the quality function that penalizes size of the changes in the cleaned dataset, based on cell-wise edit distance for string values or absolute difference for numeric values. To understand the relative convergence of the methods, we report suboptimality, defined as the ratio of the quality score evaluated on the ground truth over the current best quality score. To understand the absolute cleaning improvements, we report Error, as defined by F1 of current cleaned cells with respect to the ground truth cleaned cells.

Figure 5 plots suboptimality convergence over search time in seconds. All searches are run with one search thread (AlphaClean uses one extra thread to generate conditional assignments across all cleaning operators in a loop). We find that AlphaClean quickly finds strong solutions quickly, because the asynchronous design quickly allows for partial data cleaning even if early parameter choices are suboptimal. Throughout the search process, AlphaClean is up to 9x higher quality than the next best baseline, and ultimately converges to higher quality solutions.

Figure 6 plots the error rate over search time, but the quality function computes the number of cells that differ from the ground truth dataset. We consider this the best-case gold standard quality function. We see that in this case, AlphaClean converges to the ground truth more quickly than the next best baseline (up to ).

7.3. Optimization Contributions

Figure 7. Contribution of the different optimizations. Incremental quality evaluation (Inc), asynchronous search (+Async), and learned pruning models (+Learning) all contribute to improved convergence above Naive. The hospital dataset is too small for learning, and the physician dataset is too large to finish without incremental evaluation.

Figure 7 incrementally removes components of AlphaClean to understand where the benefits come from: incremental quality evaluation (Inc), asynchronous conditional assignment generation (Async), learning a pruning model (Learn). We find that AlphaClean without any optimizations (Naive) does not finish on the Physician within an hour due to the large size of the dataset, and that pruning is ineffective for Hospital due to its small size, so do not include them in the plots.

We find that all techniques are crucial. AlphaClean is designed to quickly evaluate a large number of quality functions, thus Inc is a primary performance optimization. Async allows search to quickly explore more pipelines without being blocked by conditional assignment generation, while Learn is able to effectively prune large subsets of the search space when the dataset is large; if the dataset is small there can be too few partitions from which to collect training samples. These optimizations can improve convergence by more than 20x.

Figure 8. Scaling performance on the hospital dataset. AlphaClean can benefit from parallelism.

7.4. AlphaClean Performance Sensitivity

We now study settings that affect AlphaClean convergence rates.

Figure 9. Both experiments are on the hospital dataset. (A) Convergence with redundant cleaning operators. (B) Convergence for short (AC-/HO-) and long (AC+/HO+) operator delays.

Scaling to Cores: The asynchronous search architecture has desirable scaling properties. We compare to Grid search and vary the number of threads given to both frameworks. In AlphaClean, we allocate one thread to each data cleaning method to generate candidate conditional assignments and the remainder to the search algorithm. Figure 8 illustrates the scaling on the hospital dataset. Results suggest that AlphaClean can benefit from parallelism.

Note that most blackbox approaches such as Grid search can run cleaning operators in parallel, however they block until the operators finish before performing a search step (picking and trying a candidate pipeline) and choosing the next parameters to try. Thus, they can be blocked by straggler operators. More sophisticated hyper-parameter tuning algorithms, such as hyperopt are inherently sequential and do not run cleaning operators in parallel.

Library Size: Figure 9b uses the Hospital benchmark and varies the number of redundant cleaning operators, by duplicating the library by . Each duplicate runs in a separate thread. To exploit parallelism, we compare with grid search (Grid) using the same number of threads. AlphaClean performs nearly identically irrespective of the redundancy, while grid search degrades considerably due to the reasons described in the above scaling experiment.

Slow Cleaning Operators: We use the Hospital benchmark to study robustness against slow cleaning operators. Figure 9a compares AlphaClean (AC) to hyperopt (HO) with when adding random delays to the cleaning operators. Each operator in AlphaClean runs in a separate thread, whereas hyperopt is a sequential algorithm. AlphaClean is significantly more robust to these delays that hyperopt.

Figure 10. Convergence for short (AC-/HO-) and long (AC+/HO+) quality function evaluation delays.

Slow Quality Evaluation: AlphaClean makes a design assumption that the cleaning is the bottleneck and not the quality evaluation. Figure 10 runs the Hospital benchmark with varying delays in quality evaluation: AC-/HO- for random delays, and AC+/HO+ for random delays. While both AlphaClean (AC) and hyperopt (HO) are affected, AlphaClean is much more sensitive because AlphaClean evaluates quality functions at a far higher rate than hyperopt.

Figure 11. We apply AlphaClean to the hospital dataset with a coarsed candidate generation scheme. Each data cleaning method produces one full-table transformation per parameter setting (AC-Coarse). While it does not converge to the global solution that the original method does (AC-Fine), it still provides a benefit due to operator exclusion and re-ordering.

Coarse vs. Fine Predicates: Cleaning operators set the predicate granularity of the conditional assignments that they output. Figure 11 evaluates the trade-off between coarse (AC-Coarse) and fine-grained (AC-Fine) conditional assignment predicates on AlphaClean We generate coarse predicates by merging all conditional assignments generated by an operator into a single “meta assignment” that applies the set internally. The main difference is that AlphaClean cannot pick and choose from within the set. We see that coarse predicates is initially better because AlphaClean searches through a smaller conditional assignment pool for acceptable pipelines. However, AC-Fine converges to a better plan because it is capable of making finer-grained decisions later on. This suggests a potential coarse-then-fine hybrid approach in future work. We include hyperopt and grid as reference.

Figure 12. On a synthetic dataset with extraction and spelling errors, AlphaClean is able to combine two types of cleaning operators (Split, String) in the appropriate sequence to clean the dataset.

Sequential Data Cleaning: It is possible that a best-first search through an asynchronously generated candidate pool may affect problems where the precise sequence of data cleaning operations matters. In the last experiment, we consider a synthetic dataset similar to the City table in Section 4. We construct a dataset of 10000 tuples with string attributes str1, str2, and functional dependency str1str2. We pick 5% of the tuples and add random spelling errors or randomly swapped values. For 50% of that subset of tuples, we concatenate str1:str2 together with a separator drawn from the three characters (:,-). We then set str1 to the resulting string, and str2 to ‘’. Thus, some tuples need to be correctly split before they can be cleaned to resolve functional dependency violations.

We consider two baseline libraries that each solve one type of error: Split only contains the string split operator, String only contains the string edit operators ispell and edit_dist_match. Comb combines both libraries. The quality function is the sum of the number of functional dependency violations, spelling errors, and empty strings. We run AlphaClean with 16 threads.

Figure 12 shows that Comb takes longer than the baselines, but is capable of converging to a higher quality over all solutions. We find that the asynchrony does not affect the sequential dependency and order of operations. This is because of the tree-search, the operations that improve the quality score will be applied first and those that do not will be ignored. These ignored operations may later become relevant in later rounds of the algorithm. It is possible to have degenerate cases that mislead the pruning model, such as if every tuple must first be split before string edit fixes have any effect. However, this is unlikely othewise.

Takeaways: AlphaClean is designed to explore the plan space by leveraging the structure of data cleaning problems and out-performs generic blackbox parameter tuners. Evidence suggests that AlphaClean scales across cores, is robust to many forms of delays or redundancies, but is highly sensitive to slow quality evaluation. Designing a system that adjusts to slow operators or quality evaluations is a promising direction for future work.

Figure 13. We compare AlphaClean on the physician dataset and the air quality dataset against single standalone systems that address functional dependencies (Holoclean HC) and numerical errors (DBoost) respectively. AlphaClean can support both types of errors and wrap around a variety of frameworks, and tune these frameworks. Standalone system performance on 5 random parameters is shown as dashed lines.

7.5. Comparison w/ Standalone Systems

We now compare AlphaClean with 2 standalone cleaning systems optimized for specific classes of errors: HoloClean (rekatsinas2017holoclean, ) cleans functional dependency violations in the Physician data and dBoost (mariet2016outlier, ) detects numerical errors (we use last known good value as the replacement) in the LAQ data. We compare AlphaClean with the default library, the standalone system, and AlphaClean with the standalone system wrapped as a cleaning operator. Note that AlphaClean’s quality function expresses both benchmarks, whereas each standalone system only expresses one of the two.

Figure 13a-b illustrates the results. Even when a single data cleaning method can directly optimize the quality specification (i.e., integrity constraints), it might be beneficial to apply AlphaClean to address the weak spots of the method. On the physician dataset, Holoclean (HC) achieves an accuracy of 86% on its own, AlphaClean without using Holoclean (AC-HC) achieves 73%, and with using Holoclean (AC+HC) achieves 91%. Similarly, on the air quality dataset, AC+DBoost achieves the best results and an even higher accuracy that DBoost on its own. Furthermore, the standalone systems themselves are difficult to tune. Figure 13a-b plot the best version we found through manual parameter tuning (solid lines), as well as 5 runs with randomly sampled parameter values (dashed lines). We find that the random parameters are highly unpredictable and often generate far worse results than either AlphaClean variants.

Takeaways: AlphaClean can model standalone systems as cleaning operators and improve the quality more than AlphaClean or the standalone system on their own.

8. Related Work

Data cleaning is nearly as old as the relational model (codd1970relational, ), and numerous research and commercial systems have been proposed to improve data cleaning efficiency and accuracy (see (rahm2000data, ) for a survey). The recent advances in scalable data cleaning (wang1999sample, ; DBLP:journals/debu/KrishnanWFGKM015, ; khayyat2015bigdansing, ; altowim2014progressive, ; he2016interactive, ; rekatsinas2017holoclean, ) has revealed human-time—finding and understanding errors, formulating desired characteristics of the data, writing and debugging the cleaning pipeline, and basic software engineering—as a dominant bottleneck in the entire data cleaning process (krishnan2016hilda, ). AlphaClean aims to address this bottleneck by using the quality function and conditional assignment API as a flexible and expressive declarative interface to separate high level cleaning goals from how the goals are achieved.

Machine Learning in Data Cleaning: Machine learning has been widely used to improve the efficiency and/or reliability of data cleaning (DBLP:journals/pvldb/YakoutENOI11, ; yakout2013don, ; gokhale2014corleone, ). It is commonly used to predict an appropriate replacement attribute value for dirty records (yakout2013don, ). Increasingly, it is used in combination with crowd-sourcing to extrapolate patterns from smaller manually-cleaned samples (gokhale2014corleone, ; DBLP:journals/pvldb/YakoutENOI11, ) and to improve reliability of the automatic repairs (DBLP:journals/pvldb/YakoutENOI11, ). Concepts such as active learning can be leveraged to learn an accurate model with a minimal number of examples (DBLP:journals/pvldb/MozafariSFJM14, ).

For example, Yakout et al. train a model that evaluates the likelihood of a proposed replacement value (yakout2013don, ). Another application of machine learning is value imputation, where a missing value is predicted based on those records without missing values. Machine learning is also increasingly applied to make automated repairs more reliable with human validation (DBLP:journals/pvldb/YakoutENOI11, ). Human input is often expensive and impractical to apply to entire large datasets. Machine learning can extrapolate rules from a small set of examples cleaned by a human (or humans) to uncleaned data (gokhale2014corleone, ; DBLP:journals/pvldb/YakoutENOI11, ). This approach can be coupled with active learning (DBLP:journals/pvldb/MozafariSFJM14, ) to learn an accurate model with the fewest possible number of examples. Holoclean (rekatsinas2017holoclean, ) leverages machine learning to validate repairs with a probabilistic graphical model. AlphaClean uses machine learning in the synthesis process to prune search branches. We see AlphaClean as complimentary to these techniques: as increasingly sophisticated cleaners have more opaque parameters, meta algorithms such as AlphaClean can help tune and compose them.

Application-Aware Cleaning: Semantics about the downstream application can inform ways to clean the dataset “just enough” for the application. A large body of literature addresses relational queries over databases with errors by focusing on specific classes of queries (altwaijry2015query, ), leveraging constraints over the input relation (2011Bertossi, ), integration with crowd-sourcing (DBLP:conf/sigmod/BergmanMNT15, ). Recent work such as ActiveClean (DBLP:journals/pvldb/KrishnanWWFG16, ) extend this work to downstream machine learning applications, while Scorpion (DBLP:journals/pvldb/0002M13, ) uses the visualization-specified errors to search for approximate deletion transformations. In this context, AlphaClean can embed application-specific cleaning objectives within the quality function. For instance, our the london air quality benchmark simply embeds an autoregression model into the the quality function. Recent work on quantifying incompleteness in data quality metrics (chung2016data, ) suggests that the flexibilty to embed new quality measures is of practical value.

Generating Cleaning Programs: A composable data cleaning language is the building block for systems like AlphaClean that generate cleaning pipelines. Languages for data transformations have been well-studied, and include seminal works by Raman and Hellerstein (raman2001potter, ) for schema transformations and Galhardas et al. (DBLP:conf/vldb/GalhardasFSSS01, ) for declarative data cleaning. These ideas were later extended in the Wisteria project (DBLP:journals/pvldb/HaasKWF015, ) to parameterize the transformations to allow for learning and crowdsourcing. Wrangler (wrangler, ) and Foofah (jin2017foofah, ) are text extraction and transformation systems that similarly formulate their problems as search over a language of text transformations, and develop manual pruning heuristics to reduce the search space. We do not intend for AlphaClean to be applied to schema transformation problems and design AlphaClean around existing patterns observed in data cleaning. We defer the study of a broader programming-by-example data cleaning suite to future work.

9. Conclusion and Future Work

The research community has developed increasingly sophisticated data cleaning methods (dc, ; rekatsinas2017holoclean, ; DBLP:journals/pvldb/KrishnanWWFG16, ; DBLP:conf/sigmod/ChuIKW16, ; mudgal2018deep, ; doan2018toward, ). The burden on the analyst is gradually shifting away from the design of hand-written data cleaning scripts, to building and tuning complex pipelines of automated data cleaning libraries. The main insight of this paper is that tuning pipelines of data cleaning operations is very different than tuning pipelines for machine learning.

Rather than treat each pipeline component as a black-box transformation of the relation, AlphaClean canonicalizes their repairs as conditional assignment operations. Given a library of cleaning operators, their outputs contribute to a pool of conditional assignments. This defines a well-posed search space, namely, the set of all pipelines composed of conditional assignments.

Although our results suggest that leveraging advances in planning and optimization can solve a range of data cleaning benchmarks, they are counter-intuitive because of greedy nature of the system and its enormous search space. This raises a number of questions about future opportunities in data cleaning. Why does a greedy search achieve strong results on widely-used cleaning benchmarks? Are the benchmarks too simple or are cleaning problems simply highly structured? We hope to understand the fundamental reasons for when and why search-based approaches should perform well.

In addition, we are excited to extend AlphaClean towards a more flexible, visual, and interactive cleaning process. We plan to integrate AlphaClean with a data visualization system (Wu2017CombiningDA, ) so users can visually manipulate data visualizations that are translated into quality functions. This will also require work to characterize failure modes and provide high-level tools to debug such cases.

References

  • [1] For big-data scientists, ’janitor work’ is key hurdle to insights. http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html.
  • [2] A. V. Aho, C. Beeri, and J. D. Ullman. The theory of joins in relational databases. In TODS. ACM, 1979.
  • [3] Y. Altowim, D. V. Kalashnikov, and S. Mehrotra. Progressive approach to relational entity resolution. In VLDB, 2014.
  • [4] H. Altwaijry, S. Mehrotra, and D. V. Kalashnikov. Query: a framework for integrating entity resolution with query processing. In VLDB. VLDB Endowment, 2015.
  • [5] P. Bailis, E. Gan, S. Madden, D. Narayanan, K. Rong, and S. Suri. Macrobase: Prioritizing attention in fast data. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 541–556. ACM, 2017.
  • [6] D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc, et al. Tfx: A tensorflow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1387–1395. ACM, 2017.
  • [7] M. Bergman, T. Milo, S. Novgorodov, and W. C. Tan. Query-oriented data cleaning with oracles. In SIGMOD, 2015.
  • [8] J. Bergstra, D. Yamins, and D. D. Cox. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. Citeseer, 2013.
  • [9] L. E. Bertossi. Database Repairing and Consistent Query Answering. Morgan & Claypool Publishers, 2011.
  • [10] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60(3):630–659, 2000.
  • [11] X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang. Data cleaning: Overview and emerging challenges. In SIGMOD, 2016.
  • [12] Y. Chung, S. Krishnan, and T. Kraska. A data quality metric (dqm): How to estimate the number of undetected errors in data sets. 2014.
  • [13] E. F. Codd. A relational model of data for large shared data banks. In Communications of the ACM. ACM, 1970.
  • [14] A. Deutsch, A. Nash, and J. B. Remmel. The chase revisited. In PODS, 2008.
  • [15] A. Doan, P. Konda, A. Ardalan, J. R. Ballard, S. Das, Y. Govind, H. Li, P. Martinkus, S. Mudgal, E. Paulson, et al. Toward a system building agenda for data integration (and data science). IEEE Data Eng. Bull., 41(2):35–46, 2018.
  • [16] W. Fan, J. Li, N. Tang, et al. Incremental detection of inconsistencies in distributed data. IEEE Transactions on Knowledge and Data Engineering, 26(6):1367–1383, 2014.
  • [17] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. Declarative data cleaning: Language, model, and algorithms. In PVLDB, 2001.
  • [18] C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD, 2014.
  • [19] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1487–1495. ACM, 2017.
  • [20] A. Gupta, H. V. Jagadish, and I. S. Mumick. Data integration using self-maintainable views. In International Conference on Extending Database Technology, pages 140–144. Springer, 1996.
  • [21] D. Haas, S. Krishnan, J. Wang, M. J. Franklin, and E. Wu. Wisteria: Nurturing scalable data cleaning infrastructure. In VLDB, 2015.
  • [22] J. He, E. Veltri, D. Santoro, G. Li, G. Mecca, P. Papotti, and N. Tang. Interactive and deterministic data cleaning. In Proceedings of the 2016 International Conference on Management of Data, pages 893–907. ACM, 2016.
  • [23] I. Ilyas. Data cleaning is a machine learning problem. http://wp.sigmod.org/?p=2288, 2018.
  • [24] I. F. Ilyas, X. Chu, et al. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends® in Databases, 5(4):281–393, 2015.
  • [25] Z. Jin, M. R. Anderson, M. Cafarella, and H. Jagadish. Foofah: Transforming data by example. In SIGMOD, 2017.
  • [26] S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: interactive visual specification of data transformation scripts. In CHI, 2011.
  • [27] Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J.-A. Quiané-Ruiz, N. Tang, and S. Yin. Bigdansing: A system for big data cleansing. In SIGMOD, 2015.
  • [28] S. Krishnan, D. Haas, M. J. Franklin, and E. Wu. Towards reliable interactive data cleaning: A user survey and recommendations. In HILDA, 2016.
  • [29] S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, T. Kraska, T. Milo, and E. Wu. Sampleclean: Fast and reliable analytics on dirty data. In IEEE Data Eng. Bull., 2015.
  • [30] S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning for statistical modeling. In PVLDB, 2016.
  • [31] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1):6765–6816, 2017.
  • [32] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, and I. Stoica. Tune: A research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118, 2018.
  • [33] London air quality. https://www.londonair.org.uk/london/asp/datadownload.asp.
  • [34] Z. Mariet, R. Harding, S. Madden, et al. Outlier detection in heterogeneous datasets using automatic tuple expansion. 2016.
  • [35] B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Scaling up crowd-sourcing to very large datasets: A case for active learning. In VLDB, 2014.
  • [36] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data, pages 19–34. ACM, 2018.
  • [37] E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. In IEEE Data Eng. Bull., 2000.
  • [38] V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data cleaning system. In VLDB, 2001.
  • [39] Ray: A high-performance distributed execution engine. https://github.com/ray-project/ray.
  • [40] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. In arXiv, 2017.
  • [41] S. J. Russell and P. Norvig. Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited,, 2016.
  • [42] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
  • [43] E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, and B. Recht. Keystoneml: Optimizing pipelines for large-scale advanced analytics. In Data Engineering (ICDE), 2017 IEEE 33rd International Conference on, pages 535–546. IEEE, 2017.
  • [44] J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD, 2014.
  • [45] E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. In VLDB, 2013.
  • [46] E. Wu, F. Psallidas, Z. Miao, H. Zhang, and L. Rettig. Combining design and performance in a data visualization management system. In CIDR, 2017.
  • [47] M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, 2013.
  • [48] M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. In VLDB, 2011.
  • [49] Y. Zhao, Z. Nasrullah, and Z. Li. Pyod: A python toolbox for scalable outlier detection. arXiv preprint arXiv:1901.01588, 2019.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
365514
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description