Data cleaning can be naturally framed as probabilistic inference in a generative model, combining a prior distribution over ground-truth databases with a likelihood that models the noisy channel by which the data are filtered and corrupted to yield incomplete, dirty, and denormalized datasets. Based on this view, we present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data. PClean is powered by three modeling and inference contributions: (1) a non-parametric model of relational database instances, customizable via probabilistic programs, (2) a sequential Monte Carlo inference algorithm that exploits the model’s structure, and (3) near-optimal SMC proposals and blocked Gibbs rejuvenation moves constructed on a per-dataset basis. We show empirically that short (<50-line) PClean programs can be faster and more accurate than generic PPL inference on multiple data-cleaning benchmarks; perform comparably in terms of accuracy and runtime to state-of-the-art data-cleaning systems (unlike generic PPL inference given the same runtime); and scale to real-world datasets with millions of records.
Real-world data is often noisy and incomplete, littered with NULL values, typos, duplicates, and inconsistencies. Cleaning dirty data is important for many workflows, but can be difficult to automate, as it often requires judgment calls about objects in the world (e.g., to decide whether two records refer to the same hospital, or which of several cities called “Jefferson” someone lives in).
This paper presents PClean, a domain-specific generative probabilistic programming language (PPL) for Bayesian data cleaning. Although generative models provide a conceptually appealing approach to data cleaning, they have proved difficult to apply, due to the heterogeneity of real-world error patterns [Abedjan2016] and the difficulty of inference. Like some PPLs (e.g. BLOG [Milch2006]), PClean programs encode generative models of relational domains, with uncertainty about a latent database of objects and relationships underlying a dataset. However, PClean’s approach is inspired by domain-specific PPLs, such as Stan [Carpenter2017] and Picture [Kulkarni2015]: it aims not to serve all conceivable relational modeling needs, but rather to enable fast inference, concise model specification, and accurate cleaning on large-scale problems. It does this via three modeling and inference contributions:
PClean introduces a domain-general non-parametric prior on the number of latent objects and their link structure. PClean programs customize the prior via a relational schema and via generative models for objects’ attributes.
PClean inference is based on a novel sequential Monte Carlo (SMC) algorithm, to initialize the latent database with plausible guesses, and novel rejuvenation updates to fix mistakes.
PClean provides a proposal compiler that generates near-optimal SMC proposals and Metropolis-Hastings rejuvenation proposals given the userâs dataset, PClean program, and inference hints. These proposals improve over generic top-down PPL inference by incorporating local Bayesian reasoning within user-specified subproblems and heuristics from traditional cleaning systems.
Together, this paper’s innovations improve over generic PPL inference techniques, and enable fast and accurate cleaning of challenging real-world datasets with millions of rows.
1.1 Related work
Many researchers have proposed generative models for data cleaning in specific datasets [Pasula2003, Kubica2003, MayfieldJenniferNeville2009, Matsakis2010, Xiong2011, Hu2012, Zhao2012, Abedjan2016, De2016, Steorts2016, Winn2017, DeSa2019]. Generative formulations specify a prior over latent ground truth data, and a likelihood that models how the ground truth is noisily reflected in dirty datasets. In contrast, PClean’s PPL makes it easy to write short (<50 line) programs to specify custom priors for new datasets, and yield inference algorithms that deliver fast, accurate cleaning results.
There is a rich literature on Bayesian approaches to modeling relational data [Friedman1999], including ‘open-universe’ models with identity and existence uncertainty [Milch2006]. Several PPLs could express data cleaning models [milch2005, goodman2012church, dippl, Tolpin2016, Mansinghka2014, bingham2019, Cusumano-Towner2019], but in practice, generic PPL inference is often too slow. This paper introduces new algorithms that enable PClean to scale better, and demonstrates external validity of the results by calibrating PClean’s runtime and accuracy against SOTA data-cleaning baselines [Dallachiesat2013, Rekatsinas2017] that use machine learning and weighted logic (typical of discriminative approaches [Mccallum2003, Wellner2004, Wick2013]). Some of PClean’s inference innovations have close analogues in traditional cleaning systems; for example, PClean’s preferred values from Section 3.3 are related to HoloClean’s notion of domain restriction. In fact, PClean can be viewed as a scalable, Bayesian, domain-specific PPL implementation of the PUD framework from [DeSa2019] (which abstractly characterizes the HoloClean implementation from [Rekatsinas2017], but does not itself include PClean’s modeling or inference innovations).
In this section, we present the PClean modeling language, which is designed for encoding domain-specific knowledge about data and likely errors into concise generative models. PClean programs specify (i) a prior distribution over a latent ground-truth relational database of entities underlying the user’s dataset, and (ii) an observation model describing how the attributes of entities from are reflected in the observed flat data table . Unlike general-purpose probabilistic programming languages, PClean does not afford the user complete freedom in specifying . Instead, we impose a novel domain-general structure prior on the skeleton of the relational database: determines how many entities are in each latent database table, and which entities are related. The user’s program specifies , a probabilistic relational model over the attributes of the objects whose existence and relationships are given by . This decomposition limits the PClean model class, but enables the development of an efficient sequential Monte Carlo inference algorithm, presented in Section 3.
2.1 PClean Modeling Language
A PClean program (Figure 2) defines a set of classes representing the types of object that underlie the user’s data (e.g. Physician, City), as well as a query that describes how a latent object database informs the observed flat dataset .
Class declarations. The declaration of a PClean class may include three kinds of statement: reference statements (), which define a foreign key or reference slot that connects objects of class to objects of a target class ; attribute statements (), which define a new field or attribute that objects of the class possess, and declare an assumption about the probability distribution that the attribute typically follows; and parameter statements (), which introduce mutually independent hyperparameters shared among all objects of the class , to be learned from the noisy dataset. The distribution of an attribute may depend on the values of a parent set of attributes, potentially accessed via reference slots. For example, in Figure 2, the Physician class has a school reference slot with target class School, and a degree attribute whose value depends on school.degree_dist. Together, the attribute statements specify a probabilistic relational model for the user’s schema (possibly parameterized by hyperparameters ) [Friedman1999].
Query. After its class declarations, a PClean program ends with a query, connecting the schema of the latent relational database to the fields of the observed dataset. The query has the form observe , where is a class that models the records of the observed dataset (Record, in Figure 2), are the names of the columns in the observed dataset, and are dot-expressions (e.g., physician.school.name) picking out an attribute accessible via zero or more reference slots from . We assume that each observed data record represents an observation of selected attributes of a distinct object in (or objects related to it), and that these attributes are observed directly in the dataset. This means that errors are modeled as part of the latent relational database , rather than as a separate stage of the generative process. For example, Figure 2 models systematic typos in the City field, by associating each Practice with a possibly misspelled version bad_city of the name of the city in which it is located.
2.2 Non-parametric Structure Prior
A PClean program’s class declarations specify a probabilistic relational model that can be used to generate the attributes of objects in the latent database, but does not encode a prior over how many objects exist in each class or over their relationships. (The one exception is , the designated observation class, whose objects are assumed to be in one-to-one correspondence with the rows of the observed dataset .) In this section, we introduce a domain-general structure prior that encodes a non-parametric generative process over the object sets associated with each class , and over the values of each object’s reference slots. The parameter is the number of observed data records; places mass only on relational skeletons in which there are exactly objects in and every object in another class is connected via some chain of reference slots to one of them.
PClean’s generative process for relational skeletons is shown in Figure 3. First, with probability 1, we set . (The objects here are natural numbers, but any choice will do; all that matters is the cardinality of the set .) PClean requires that the directed graph with an edge for each reference slot is acyclic, which allows us to generate the remaining object sets class-by-class, processing a class only after processing any classes with reference slots targeting it. In order to generate an object set for class , we first consider the reference set of all objects with reference slots that point to it:
The elements of are pairs of an object and a reference slot; if a single object has two reference slots targeting class , then the object will appear twice in the reference set. The point is to capture all of the places in that will refer to objects of class .
Now, instead of first generating an object set and then assigning the reference slots in , we directly model the co-reference partition of , i.e., we will partition the references to objects of class into disjoint subsets, within each of which we will take all references to point to the same target object. To do this, we use the two-parameter Chinese restaurant process , which defines a non-parametric distribution over partitions of its set-valued parameter . The strength and discount control the sizes of the clusters. We can use the CRP to generate a partition of all references to class . We treat the resulting partition as the object set , i.e., each component defines one object of class :
To set the reference slots with target class , we simply look up which partition component (viewed as an element of ) was assigned to. Since we have equated these partition components with objects of class , we can directly set to point to the component (object) that contains as an element:
This procedure can be applied iteratively to generate object sets for every relevant class, and simultaneously to fill all these objects’ reference slots.
PClean’s non-parametric structure prior ensures that PClean models admit a sequential representation, which can be used as the basis of a resample-move sequential Monte Carlo inference scheme (Section 3.1). However, if the SMC and rejuvenation proposals are made from the model prior, as is typical in PPLs, inference will still require prohibitively many particles to deliver accurate results. To address this issue, PClean uses a proposal compiler that exploits conditional independence in the model to generate fast enumeration-based proposal kernels for both SMC and MCMC rejuvenation (Section 3.2). Finally, to help users scale these proposals to large data, we introduce inference hints, lightweight annotations in the PClean program that can divide variables into subproblems to be separately handled by the proposal, or direct the enumerator to focus its efforts on a dynamically computed subset of a large discrete domain (Section 3.3).
3.1 Per-observation sequential Monte Carlo with per-object rejuvenation
One representation of the PClean model’s generative process was given in Section 2: a skeleton can be generated from , then attributes can be filled in using the user-specified probabilistic relational model . Finally an observed dataset can be generated from according to the query . But a key feature of our model is that it also admits a sequential representation, in which the latent relational database is built in stages: at each stage, a single record is added to the observation class , along with any new objects in other classes that it refers to. Using this representation, we can run sequential Monte Carlo on the model, building a particle approximation to the posterior that incorporates one observation at a time.
Database increments. Let be a database with designated observation class . Assume , the object set for the class , is . Then the database’s increment is the object set
along with their attribute values and targets of their reference slots. Objects in may refer to other objects within the increment, or in earlier increments. That is, the increment of a database is the set of objects referenced by the observation object, but not from any other observation object .
Sequential generative process. Figure 4 shows a generative process equivalent to the one in Section 2, but which generates the attributes and reference slots of each increment sequentially. Intuitively, the database is generated via a Chinese-restaurant ‘social network’: Consider a collection of restaurants, one for each class , where each table serves a dish representing an object of class . Upon entering a restaurant, customers either sit at an existing table or start a new one, as in the usual generalized CRP construction. But these restaurants require that to start a new table, customers must first send friends to other restaurants (one to the target of each reference slot). Once they are seated at these parent restaurants, they phone the original customer to help decide what to order, i.e., how to sample the attributes of the new table’s object, informed by their dishes (the objects of class ). The process starts with customers at the observation class ’s restaurant, who sit at separate tables; each customer who sits down triggers the sampling of one increment.
SMC inference with object-wise rejuvenation. The sequential representation yields a sequence of intermediate unnormalized target densities for SMC:
Particles are initialized to hold an empty database, to which proposed increments are added each iteration. As is typical in SMC, at each step, the particles are reweighted according to how well they explain the new observed data, and resampled to cull low-weight particles while cloning and propagating promising ones. This process allows the algorithm to hypothesize new latent objects as needed to explain each new observation, but not to revise earlier inferences about latent objects (or delete previously hypothesized objects) in light of new observations; we address this problem with MCMC rejuvenation moves. These moves select an object , and update all ’s attributes and reference slots in light of all relevant data incorporated so far. In doing so, these moves may also lead to the “garbage collection” of objects that are no longer connected to the observed dataset, or to the insertion of new objects as targets of ’s reference slots.
3.2 Compiling data-driven SMC proposals
Proposal quality is the determining factor for the quality of SMC inference: at each step of the algorithm, a proposal generates proposed additions to the existing latent database to explain the observed data point, . A key limitation of the sequential Monte Carlo implementations in most general-purpose PPLs today is that the proposals are not data-driven, but rather based only on the prior: they make blind guesses as to the latent variable values and thus tend to make proposals that explain the data poorly. By contrast, PClean compiles proposals that use exact enumerative inference to propose discrete variables in a data-driven way. This approach extends ideas from [arora2012gibbs] to the block Gibbs rejuvenation and block SMC setting, with user-specified blocking hints. These proposals are locally optimal for models that contain only discrete finite-domain variables, meaning that of all possible proposals they minimize the divergence
The distribution on the left represents a perfect sample from the target given the first observations, extended with the proposal . The distribution on the right is the target given the first data points. In our setting the locally optimal proposal is given by
Algorithm 1 shows how to compile this distribution to a Bayesian network; when the latent attributes have finite domains, the normalizing constant can be computed and the locally optimal proposal can be simulated (and evaluated) exactly. This is possible because there are only a finite number of instantiations of the random increment to consider. The compiler generates efficient enumeration code separately for each pattern of missing values it encounters in the dataset, exploiting conditional independence relationships in each Bayes net to yield potentially exponential savings over naive enumeration. A similar strategy can be used to compile data-driven object-wise rejuvenation proposals, and to handle some continuous variables with conjugate priors; see supplement for details.
3.3 Scaling to large data with inference hints
Scaling to models with large-domain variables and to datasets with many rows is a key challenge. In PClean, users can specify lightweight inference hints to the proposal compiler, shown in gray in Figure 2, to speed up inference without changing model’s meaning.
Programmable subproblems. First, users may group attribute and reference statements into blocks by wrapping them in the syntax . This partitions the attributes and reference slots of a class into an ordered list of subproblems, which SMC uses as intermediate target distributions. This makes enumerative proposals faster to compute, at the cost of considering less information at each step; rejuvenation moves can often compensate for short-sighted proposals.
Adaptive mixture proposals with dynamic preferred values. A random variable within a model may be intractable to enumerate. For example, string_prior(1, 100) is a distribution over all strings between 1 and 100 letters long. To handle these, PClean programs may declare preferred values hints. Instead of , the user can write where the final expression gives a list of values on which the posterior mass is expected to concentrate. When enumerating, PClean replaces the CPD with a surrogate , which is equal to for preferred value inputs in , but 0 for all other values. The mass not captured by the preferred values, , is assigned to a special other token. Enumeration yields a partial proposal over a modified domain; the full proposal first draws from then replaces other tokens with samples from the appropriate CPDs . This yields a mixture proposal between the enumerative posterior on preferred values and the prior: when none of the preferred values explain the data well, other will dominate, causing the attribute to be sampled from its prior. But if any of the preferred values are promising, they will almost certainly be proposed.
In this section, we demonstrate empirically that (1) PClean’s inference works when standard PPL inference strategies fail, (2) short PClean programs suffice to compete with existing data cleaning systems in both runtime and accuracy, and (3) PClean can scale to large real-world datasets. Experiments were run on a laptop with a 2.6 GHz CPU and 32 GB of RAM.
(1) Comparison to Generic PPL Inference. We evaluate PClean’s inference against standard PPL inference algorithms reimplemented to work on PClean models, on a popular benchmark from the data cleaning literature (Figure 5). We do not compare directly to other PPLs’ implementations, because many (e.g. BLOG) cannot represent PClean’s non-parametric prior. Some languages (e.g. Turing) have explicit support for non-parametric distributions, but could not express PClean’s recursive use of CRPs. Others could in principle express PClean’s model, but would complicate an algorithm comparison in other ways: Venture’s dynamic dependency tracking is thousands of times slower than SOTA; Pyro’s focus is on variational inference, hard to apply in PClean models; and Gen supports non-parametrics only via the use of mutation in its slower dynamic modeling language (making SMC ) or via low-level extensions that would amount to reimplementing PClean using Gen’s abstractions. Nonetheless, the algorithms in Figure 5 are inspired by the generic automated inference provided in many PPLs, which use top-down proposals from the prior for SMC, MH [dippl, ritchie2016c3], and PGibbs [wood2014new, Murray2015, Mansinghka2014]. Our results show that PClean suffices for fast, accurate inference where generic techniques fail, and also demonstrate why inference hints are necessary for scalability: without subproblem hints, PClean takes much longer to converge, even though it eventually arrives at a similar value.
|Time||4.5s||1m 10s||1m 32s||27.6s||22.8s|
|Time||1m 20s||20m 16s||13m 43s||13s||7.2s|
(2) Applicability to Data Cleaning. To check PClean’s modeling and inference capabilities are good for data cleaning in absolute terms (rather than relative to generic PPL inference), we contextualize PClean’s accuracy and runtime against two SOTA data-cleaning systems on three benchmarks with known ground truth (Table 1), described in detail in the supplement. Briefly, the datasets are Hospital, a standard benchmark with artificial typos in 5% of cells; Flights, a standard benchmark resolving flight details from conflicting real-world data sources; and Rent, a synthetic dataset based on census data, with continuous and discrete values. The systems are HoloClean [Rekatsinas2017], based on probabilistic machine learning, and NADEEF, which uses MAX-SAT solvers to adjudicate between user-defined cleaning rules [Dallachiesat2013]. For HoloClean, we consider both the original code and the authors’ latest (unpublished) version on GitHub; for NADEEF, we include results both with NADEEF’s built-in rules interface alone and with custom, handwritten Java rules.
Table 1 reports scores and cleaning speed (see supplement for precision/recall). We do not aim to anoint a single ’best cleaning system,â since optimality depends on the available domain knowledge and the user’s desired level of customization. Further, while we followed system authorsâ per-dataset recommendations where possible, a pure system comparison is difficult, since each system relies on its own rule configuration. Rather, we note that short (<50-line) PClean programs can encode knowledge useful in practice for cleaning diverse data, and inference is good enough to achieve scores as good or better than SOTA data-cleaning systems on all three datasets, often in less wall-clock time. Additionally, PClean programs are concise, and e.g. could encode in a single line what required 50 lines of Java for NADEEF (see supplement).
(3) Scalability to large, real-world data. We ran PClean on the Medicare Physician Compare National dataset, shown earlier in Figure 1. It contains 2.2 million records, each listing a clinician and a practice location; the same clinician may work at multiple practices, and many clinicians may work at the same practice. NULL values and systematic errors are common (e.g. consistently misspelled city names for a practice).
Running PClean took 7h36m, changing 8,245 values and imputing 1,535,415 missing cells. In a random sample of 100 imputed cells, 90% agreed with manually obtained ground truth. We also manually checked PClean’s changes, and 7,954 (96.5%) were correct. Of these, some were correct normalization (e.g. choosing a single spelling for cities whose names could be spelled multiple ways). To calibrate, NADEEF only changes 88 cells across the whole dataset, and HoloClean did not initialize in 24 hours, using the configuration provided by HoloClean’s authors.
Figure 1 shows PClean’s real behavior on four rows. Consider the misspelling Abington, MD, which appears in 152 entries. The correct spelling Abingdon, MD occurs in only 42. However, PClean recognizes Abington, MD as an error because all 152 instances share a single practice address, and errors are modeled as happening systematically at the practice level. Next, consider PClean’s correct inference that K. Ryan’s degree is DO. PClean leverages the fact that her school PCOM awards more DOs than MDs, even though more Family Medicine doctors are MDs than DOs. All parameters enabling this reasoning are learned from the dirty data.
PClean, like other domain-specific PPLs, aims to be more automated and scalable than general purpose PPLs, by leveraging structure in its restricted model class to deliver fast inference. At the same time, it aims to be expressive enough to concisely solve a broad class of real-world data cleaning problems.
One direction for future research is to quantify the ease-of-implementation, runtime, accuracy, and program length tradeoffs that PClean users can achieve, given varying levels of expertise. Rigorous user studies could calibrate these results against other data cleaning, de-duplication, and record linkage systems. One challenge is to account for the subtle differences in the knowledge representation approach between PClean (causal and generative) and most other data cleaning systems (based on learning and/or weighted logic)
It may be possible to relax PClean’s modeling restrictions without sacrificing inference performance and accuracy. One approach could be to integrate custom open-universe priors with explicit number statements and recursive object-level generative processes
The authors are grateful to Zia Abedjan, Marco Cusumano-Towner, Raul Castro Fernandez, Cameron Freer, Divya Gopinath, Christina Ji, Tim Kraska, George Matheos, Feras Saad, Michael Stonebraker, Josh Tenenbaum, and Veronica Weiner for useful conversations and feedback, as well as to anonymous referees on earlier versions of this work. This work is supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. 1745302; DARPA, under the Machine Common Sense (MCS) and Synergistic Discovery and Design (SD2) programs; gifts from the Aphorism Foundation and the Siegel Family Foundation; a research contract with Takeda Pharmaceuticals; and financial support from Facebook, Google, and the Intel Probabilistic Computing Center.
Appendix A Baseline Inference Algorithms
The paper’s Figure 5 shows median accuracy vs. time for five independent runs of nine inference algorithms. These results were computed using the PClean program shown in Appendix B.4.1, on a version of the Hospital dataset (Appendix B.1) with 20% of its cells deleted at random, to test both repair and imputation (the original Hospital dataset has many errors, but very few missing cells). Below, we give descriptions of each inference algorithm we test:
PClean SMC (2 particles) followed by PClean rejuvenation is the inference algorithm described in Section 3. First, a complete run of 2-particle sequential Monte Carlo, using PClean’s enumeration-based compiled proposals, is completed, incorporating all 1000 rows of the dataset. Then, one of the two particles is selected, and for each object in its latent database, a block rejuvenation MCMC kernel is run, also using PClean’s enumeration-based compiled proposal. (The number of MCMC moves completed during this sweep will depend on the number of objects inferred for the latent database, a quantity that varies from run to run. See note below this list for an explanation of how median accuracies were computed across runs with different numbers of iterations.)
PClean SMC (2 particles) is the same as the above except that no rejuvenation sweep is performed.
PClean SMC (2 particles) followed by PClean rejuvenation, no subproblem hints is the same as (1), except we disregard subproblem hints in the PClean program. (The program in question, shown in Appendix B.4.1, has two subproblem hints.) As a result, SMC takes bigger steps, and enumerative proposals take longer to execute (but are higher quality).
PClean SMC (20 particles) followed by PClean rejuvenation is the same as (1) except with 20 particles, instead of 2.
PClean MCMC initializes the latent database using ancestral sampling, i.e., from the prior, but modified to use observed values when they are available. It then performs two complete MCMC sweeps, using PClean’s block rejuvenation proposals; each sweep performs an MCMC move for each object in the current latent database.
Generic MCMC initializes the latent database as in (5), and performs ten complete sweeps using single-site Metropolis-Hastings (tens of thousands of accept/reject steps). That is, each individual attribute or reference slot is separately updated, using the prior as proposal. When a reference slot is proposed, there is a chance that a new object is also proposed as its target. We note that our implementation is much faster than most PPLs’ single-site Metropolis-Hastings implementations, as it re-evaluates only those likelihood terms affected by the proposed single-variable change.
Generic SMC (100 particles) followed by generic PGibbs rejuvenation (100 particles) initializes the latent database using 100-particle sequential Monte Carlo, using the same sequence of target distributions as in PClean SMC, but with the prior as a proposal. This is followed by three sweeps of Particle Gibbs rejuvenation moves: as in PClean rejuvenation from (1), we perform per-object updates, but the proposal is generated not via PClean’s enumerative proposal compiler, but rather by using 100-particle conditional sequential Monte Carlo (CSMC) [like a Gibbs move, this proposal is always accepted]. We note that this baseline improves over existing PPLs’ support for Particle Gibbs in several ways. First, Particle Gibbs updates only those variables connected to a particular latent object, rather than trying to update the entire model state at once. Second, incremental SMC weights are computed incrementally, evaluating only those likelihood terms that are necessary. Third, a reweighting (and, based on ESS, possibly resampling) step is triggered whenever a new likelihood term could possibly be evaluated, regardless of how the PClean program is written. However, unlike PClean’s rejuvenation moves (but like many other PPL implementations), our “generic PGibbs rejuvenation” uses proposals from the prior for its CSMC sweeps, greatly limiting its effectiveness. (We note that delayed sampling [murray2018delayed, wigren2019parameter] is a sophisticated PPL technique that could provide benefits similar to those provided by PClean’s proposal; however, to our knowledge delayed sampling is not implemented in any PPL capable of performing SMC in PClean’s model.)
Generic SMC (100 particles) followed by generic rejuvenation initializes the latent database using 100-particle sequential Monte Carlo, as in (7). It then performs five single-site Metropolis-Hastings rejuvenation sweeps (tens of thousands of accept/reject steps), as described in (6).
Generic SMC (100 particles) initializes the latent database as in (7) and performs no additional rejuvenation.
For each run of each algorithm, time and accuracy were measured after each SMC step or MCMC transition. Since steps/transitions finished at different timestamps across runs, and because each run of an algorithm lasted a different number of steps (due to the stochastic number of objects in the latent database), we used linear interpolation to approximate a continuous time/accuracy curve for each run. Then, to plot median performance across the five runs, we took the median value across the interpolated curves at a fixed set of times. In all nine algorithms, all five runs ended at roughly the same time; the plotted endpoint for each algorithm was chosen as the time when the last run was complete. For any run that finished slightly earlier, the accuracy value was extrapolated as the accuracy at its last timestamp.
Appendix B Evaluation on Data Cleaning Benchmarks: Datasets, Systems, and System Configurations
Table 1 of our paper provides evidence of PClean’s applicability to data-cleaning problems, by comparing accuracy and runtime for three PClean programs against state-of-the-art data cleaning systems applied to the same benchmark datasets. The table reports scores, but omits the breakdown in terms of recall () and precision (), the metrics from which is derived. The table below presents a fuller picture:
|Time||4.5s||1m 10s||1m 32s||27.6s||22.8s|
|Time||1m 20s||20m 16s||13m 43s||13s||7.2s|
The remainder of this appendix describes in detail: each benchmark dataset (Appendix B.1), each baseline system (Appendix B.2), the HoloClean and NADEEF configurations used for each baseline (emphasizing the ways in which we attempted to encode dataset-specific domain knowledge) (Appendix B.3), and the PClean programs we used for each dataset (Appendix B.4).
b.1 Description of Benchmarks
The three smaller benchmarks are included in the supplementary code zip; Physicians is excluded for size, but is available online.
Hospital is a real-world Medicare dataset, but with artificially introduced typos in approximately 5% of its 19,000 cells (1000 rows, 19 columns). Each row reports the performance of a particular hospital on a particular metric, and it includes metadata such as hospital address and phone number. This leads to a lot of duplicated information, as the same hospital appears multiple times (with different metrics), and the same metrics also appear multiple times (with different hospitals). All this duplication facilitates accurate cleaning even in the presence of typos.
Flights consists of 2,377 rows describing real-world flight, their scheduled departure/arrival times, and their true departure/arrival times, as scraped from the web. These times often conflict between the sources, so the task is to integrate them to form a consistent dataset. We use the version from [Mahdavi2019].
Rents is a new synthetic dataset of apartment listings that we derived from census and housing statistics [USCensusBureau2019]. It contains bedroom size, rent, county, and state. We first generated a clean dataset with 50,000 rows in the following manner:
The county-state combination is chosen proportionally to its population in the United State
The size of the apartment is chosen uniformly from studio, 1 bedroom, 2 bedroom, 3 bedroom, 4 bedroom.
The rent is chosen according to a normal distribution in which the mean is the median rent for an apartment of the chosen size in the chosen country and the standard deviation is chosen to be 10% of the mean
The dataset was then dirtied in the following ways:
10% of state names are deleted (many counties exist across multiple states, e.g. 30 states have a Washington County).
Approximately 1-2% of county names are misspelled
10% of apartment sizes are deleted
1% of apartment prices are listed in the incorrect units (thousands of dollars, instead of dollars)
b.2 Description of State-of-the-Art Data-Cleaning Systems
HoloClean is a data-cleaning system, which compiles user-provided integrity constraints and when available, external ground-truth, into a factor graph with learned weights [Rekatsinas2017]. These integrity constraints describe cells that should match, conditional on the agreement of other fields, e.g. if zip codes of two rows match, the states in those two rows should match. These constraints can also be made with respect to external data (e.g. if a row’s zip code in the table matches a zip code in a gazetteer, the row’s state should match the corresponding state in the gazetteer).
NADEEF is a data-cleaning system that leverages user-specified cleaning rules [Dallachiesat2013]. NADEEF compiles users’ rules into a weighted MAX-SAT query and runs it through a solver, then uses the results to clean the data. User-specified rules can either be integrity constraints (as HoloClean) or handcrafted rules. These handcrafted rules take the form of Java classes, in which users write a detect function that takes in a pair of tuples and outputs whether one or more violations have been detected, and if so, over which groupings of cells. The user can also optionally write a repair function that takes in those detected cells, and returns a fix. That is, unlike in PClean, user-encoded knowledge explicitly describes how to both detect and repair violations.
To our knowledge, neither system comes with special logic for handling text fields, dates, etc. as distinct from general categorical data.
b.3 Settings for Data-Cleaning Systems
Below, we present the integrity constraints we encoded in both HoloClean and NADEEF, as well as the handcrafted Java rules for NADEEF. The integrity constraints are presented as , which means that for two rows, if all columns in match, one should expect all columns in to also match.
For each NADEEF Java rule, we describe the functionality and report the number of lines of code used to encode it (ignoring imports, boilerplate, and parentheses). All integrity constraints and Java rules can also be found in the supplementary code.
On encoding domain knowledge. Data cleaning is of course easier with accurate domain knowledge about the data and the likely errors. This is one reason we developed PClean: to enable generatively encoded domain knowledge to inform a data cleaning system. This does, however, raise the question of how to compare PClean fairly to other data-cleaning systems: if PClean is more accurate only because it encodes more domain knowledge, it would be misleading to claim that PClean is ‘better’ in some absolute sense than an existing system. Our evaluation in Section 4 specifically explains that this is not our intention: we just mean to contextualize PClean’s accuracy and runtime in the context of other data-cleaning systems, using reasonable configurations for those systems.
That said, we tried our best to encode as much helpful domain knowledge as we could into the configurations for HoloClean and NADEEF. Some of the settings below were chosen in response to direct advice from authors of each system; others were based on existing scripts, written by the system authors, for cleaning these benchmark datasets (some of our benchmarks also appeared in the papers presenting these systems). In addition, we tried tweaking these configurations ourselves, and reported the best numbers we could.
It is likely that the approaches that NADEEF and HoloClean take, of using weighted logic and factor graphs, could in principle express richer domain knowledge than our configurations here encode. But to our knowledge, the current systems do not expose these capabilities in easy-to-exploit ways.
Hospital Name determines Phone Number, City, ZIP Code, State, Address, Provider Number, County Name, Hospital Type, and Hospital Owner.
Phone Number determines City, ZIP Code, State, Address1, Provider Number, County Name, Hospital Type, Hospital Owner.
ZIP Code determines City and State.
Measure Code determines Measure Name and Condition.
Measure Code and State together determine State Average.
The State Average field is a concatenation of the Measure Code and State fields. For any row, we raise a violation if the concatenation does not hold over those three cells. We do not provide a repair, since it’s unclear from that row alone which of the three cells is the incorrect one. This took 9 lines of Java code.
Flight number determines both the Scheduled Departure Time and the Actual Departure Time
Flight number determines both the Scheduled Departure Time and the Actual Departure Time
For a pair of rows, if both flights have the same flight number, a violation is already raised by the existing integrity constraints if the departure or arrival time does not match. The source corresponding to the flight’s airline tends to more correct than third-party sources. Therefore, when applicable over a pair of rows, we provided the suggested repair of choosing the time from the website of the airline. This took 52 lines of Java code.
County determines State.
If a state was missing for a rental listing, we suggested that NADEEF choose the repair of the most common state corresponding to a given county (which it would not otherwise do), requiring 48 lines of Java.
Additionally, if a rent was below a certain fixed threshold, the program would flag as a violation, and multiply by the correct factor for a unit conversion. This second rule required 12 lines of Java.
The National Provider Identifier (NPI) determines the PAC ID and vice versa.
The National Provider Identifier (NPI) determines First Name, Last Name, Medical School Name, and Graduation Year.
The Group Practice ID determines the Organization name.
The Zip Code determines the City and State.
b.4 PClean Programs
In this section, we present the PClean programs we used to clean each benchmark dataset. This is the closest analogue to a ‘configuration’ of an automated data-cleaning system. But rather than encode rules for detecting and repairing errors, PClean programs encode generative models of relational databases and of the process by which they are corrupted, filtered, and joined to yield flat, dirty, denormalized datasets.
The Hospital dataset is modeled with seven classes: Records reflect typo’d attributes of Hospitals and the Measures by which they are evaluated; Hospitals have HospitalTypes and are located in Places; Places belong to County objects; and each Measure is related to some Condition. Typos are modeled as independently introduced for each cell of the dataset. Some fields are modeled as draws from broad priors over strings, whereas others are modeled as categorical draws whose domain is the set of unique observed values in the relevant column (some of which are in fact typos).
Inference hints are used to focus proposals for string_prior choices on the set of strings that have actually been observed in a given column, and also to set a custom subproblem decomposition for the Record class (all other classes use the default decomposition).
The model for Flights uses three classes: each observed Record comes from a TrackingWebsite and is about a Flight:
In the parameter declaration for error_probs, we use the syntax error_probs[_] beta(10, 50) to introduce a collection of parameters; the declared variable becomes a dictionary, and each time it is used with a new index, a new parameter is instantiated. We use this to learn a different error_prob parameter for each tracking website. We could alternatively declare error_prob as an attribute of the TrackingWebsite class. However, PClean’s inference engine uses smarter proposals for declared parameters (taking advantage of conjugacy relationships), so for our experiments, we use the parameter declaration instead. We hope to extend automatic conjugacy detection to all attributes, not just parameters, in the near future.
As in Hospital, we use observed_values to provide inference hints to the broad time_prior; this expresses a belief that the true timestamp for a certain field is likely one of the timestamps that has actually been observed, in the dirty dataset, with the given flight ID.
The program we use for Rents contains two classes: Listings are for apartments in some County:
We model the fact that the rent may be in grand instead of dollars, as well as that the county name may contain typos. We introduce an artificial field, block, consisting of the first and last letters of the observed (possibly erroneous) County field, and use it to inform an inference hint: we hint that posterior mass for a county’s name concentrates on those strings observed somewhere in the dataset that share a first and last letter in common with the observed county name for this row. Without this approximation, inference is much slower (but potentially more accurate).
The model for Physicians contains five classes: Records reference Practices and Physicians; each Physician attended some medical School; and each Practice is in a City:
Many columns are not modeled. Similar to Rents, we use a parameter in the Physician class for degree_probs, although it might seem more natural to use an attribute of the School class; the resulting model is the same, but using parameter allows PClean to exploit conjugacy.
b.5 Effect of Additional Domain Knowledge
The quality of PClean’s inference depends on the PClean program one uses to model the data. To demonstrate this, we apply four different PClean programs on Flights. In our baseline (16 lines of code), we assume all sources are equally reliable and achieve an F1 score of 0.56. By additionally modeling the timestamp format, we achieve an F1 of 0.60. If we program PClean to learn a per-source reliability (one extra line of code), F1 climbs to 0.69. Finally, if we provide our program that the airline’s own website is likely to be the most reliable for a given flight (one additional line of code for a total of 18), F1 jumps to 0.90. PClean is a language, not an automated cleaning system, and accuracy depends on encoding good domain knowledge into a reasonable generative model. Our experience modeling Flights and other datasets, however, suggests that the amount of domain knowledge necessary to improve results is reasonable and may not be too onerous to encode for many data cleaning problems.
We also implemented a user-defined cleaning rule in NADEEF, manually specifying a repair procedure for flight times that searched for a reported time from the flight’s airline, and used that if available. This rule enabled NADEEF to clean the Flights data, but required 52 lines of Java (beyond the boilerplate required for every NADEEF rule). Furthermore, as Table 1 of the paper shows, even encoding manual Java rules is, for some datasets, not enough to yield accurate cleaning.
Appendix C Additional Model Details
c.1 Discrete Random Measure representation
Our non-parametric structure prior is described by Section 2 of the paper in terms of the two-parameter Chinese Restaurant Process. It is also possible to represent the generative process encoded by a PClean program using the Pitman-Yor process:
We process classes one at a time, in topological order. For each latent class, we (1) generate class-wide hyperparameters from their corresponding hyperpriors, and (2) generate an infinite weighted collection of objects of class . In this setting, an object of class is an assignment of each attribute to a value and of each reference slot to an object of class . An infinite collection of latent objects is generated via a Pitman-Yor Process [Teh2011]:
The Pitman-Yor Process is a discrete random measure that generalizes the Dirichlet Process. It can be understood as first sampling an infinite vector of probabilities from a two-parameter GEM distribution, then setting , where each of the infinitely many objects is distributed according to . This itself is a distribution over objects, which first samples reference slots and then attributes.
To generate the objects of the observation class, which will be translated by the program’s query into the flat dataset D, we sample from its prior distribution, then, for , generate the observed entry: .
c.2 Description of primitive distributions
Our models for particular datasets make use of PClean’s built-in probability distributions, which include not just the common distributions for categorical and numerical data, but also several domain-specific distributions useful for modeling strings and random errors. We briefly summarize several of PClean’s built-in distributions here, before showing how to compose them into short PClean programs:
string_prior(min, max) encodes a prior over strings between min and max characters long. The length is uniformly distributed within that range, and characters follow a Markov model based on relative character bigram frequencies in English.
typos(str) is a distribution over strings centered at str. The generative process it represents is to sample a number of typos from a negative binomial distribution whose number-of-trials parameter depends on the length of str. That many typos (random insertions, deletions, substitutions, or transpositions) are then performed. The likelihood is computed approximately using dynamic programming.
maybe_swap(x, ys, p) returns a true value x with probability , but chooses a replacement uniformly from ys otherwise.
transformed_normal(mean, std, bijection) samples a real number from a Gaussian distribution with the given mean and standard deviation, but then applies a transformation (the bijection). We use this distribution to model unit errors.
c.3 Discussion of expressiveness of PClean
PClean imposes restrictions relative to universal PPLs, which helped us to develop an inference algorithm that, for many PClean programs, produces results quickly and scales to large datasets. In this section, we discuss these restrictions and their implications for cleaning dirty data using PClean.
Our non-parametric prior vs. explicit user-specified priors over number of objects and link structure. A primary difference between general-purpose open-universe languages, like BLOG, and PClean’s modeling language is that PClean does not give the user control over the prior distribution over the number of objects of each class, or which objects of particular classes are related to one another.
Of course, there are exceptions. As an interesting example, consider the Hospital dataset: if we knew the population of each city, we may have been able to specify accurate priors over the number of distinct hospitals in each city, allowing us to resolve co-reference questions differently in small cities (where it is more likely that two hospitals reported with similar names are in fact the same hospital) and large cities (where it may be more plausible that two hospitals exist with very similar names). However, this factor is likely to be decisive only in high-uncertainty regimes (where the data entries themselves do not help much to resolve the co-reference question), and it is unclear whether a data-cleaning system should trust such high-uncertainty answers (vs. reporting ‘I don’t know’—see Appendix D.5). If the use case is such that it is desirable to represent such priors, similar logic might be encoded in PClean by creating two different classes for hospitals in large and small cities, and allowing their strength and discount parameters to vary independently.
On schemas with cyclic vs. acyclic class dependency graphs. PClean requires that the schema of the latent database have an acyclic class dependency graph: there cannot be a chain of reference slots such that . Although, generally speaking, many relational modeling and inference tasks may be well-served by cyclic class dependencies, we found during literature review that none of the benchmark data-cleaning problems in [Abedjan2016, Dallachiesat2013, Rekatsinas2017, Heidari2019, Hu2012, Mahdavi2019] were naturally modeled using cyclic class dependencies. In addition, [Pasula2003, Milch2006], who use BLOG for deduplication, do not use its support for reference cycles. There are, of course, some tasks for which cyclic references may be a natural fit, e.g. denoising genealogical data, where we may want to model that people have parents, who are other people, with many attributes inherited from one’s ancestors. One could still model such datasets using coarser PClean models, e.g., by clustering people into families without modeling parent/child relationships explicitly. More generally, when we wish to model objects of the same class (e.g. Person) as related via some chain of reference slots, we can often instead introduce an additional class (e.g. Family), and model any related objects of class as referring to a shared object of class .
Appendix D Additional Inference Details
d.1 Object-wise rejuvenation moves
In sequential Monte Carlo, rejuvenation moves are transition kernels that preserve the current target distribution , similar to the kernels used in Markov chain Monte Carlo algorithms. But we do not run them until convergence, instead using them to “rejuvenate” past decisions within SMC, in light of new data.
Any valid MCMC kernel for our model is also a valid rejuvenation kernel, and in particular, Gibbs kernels—which update a single variable in the latent state according to its full conditional distribution, keeping the rest of the state fixed—are a natural choice. However, variables in a model are often correlated, and it can be difficult to escape local modes by updating them one at a time. PClean uses object-wise blocked rejuvenation to address this challenge. Object-wise rejuvenation moves update all attributes and reference slots of a single object in the latent database instance . In doing so, these moves may also lead to the “garbage collection” of objects that are no longer connected to the observed dataset, or to the insertion of new objects as targets of ’s reference slots.
Let be any object in a relational database instance . Then we define , , , , and as follows:
is the partial instance obtained by erasing from : (1) all attribute values and reference slot assignments for the object ; (2) all attribute values of objects that depend on ; and (3) any objects only accessible from via slot chains that pass through ;
is the partial dataset obtained from by erasing any attribute values whose distributions depend on values no longer specified within ;
is the partial instance specifying: (1) all attribute values and reference slot assignments for the object ; and (2) all objects not in (accessible from only via slot chains that pass through ), along with their attributes and reference slots;
is the partial instance assigning values to all object attributes that depend on ’s attributes or reference slots as parents; and
is the partial dataset assigning any attributes of observation objects that depend on on ’s attributes or reference slots as parents.
The model density then factorizes as:
A blocked Gibbs sweep loops through each object and updates it:
Because resimulating may delete objects from classes that are reachable from via reference slots, we perform this sweep in reverse topological order, starting with the objects that have no reference slots, and working our way up to the observation objects. If computing the blocked Gibbs distribution is intractable, then we can further divide according to user-specified subproblem decompositions for , as discussed in Section 3.3 of the paper. As the user subproblems get smaller in size, the algorithm approaches ordinary one-variable-at-a-time Gibbs sampling; thus, choosing subproblems is a simple way that users can trade off between runtime and accuracy, based both on the needs of their application and the specific properties of their models or datasets.
Our rejuvenation kernels are compiled using PClean’s proposal compiler, and as such, also benefit from (1) efficient enumeration strategies that take advantage of conditional independence in the variables being updated, and (2) user-specified ‘preferred values’ inference hints (see Section 3.3). The paper’s Algorithm 1 can be adapted for rejuvenation by adding observed variables to the Bayesian network for each attribute value specified in (that is, each attribute value that, given the current link structure, depends on a latent variable being updated). Some of the variables within may be constrained by the observed dataset ; this will depend on the patterns of missingness in the observations that, under the current link structure, are connected in some way to the object being updated. PClean recognizes when these patterns of missingness change (due to link structure changing), and compiles new proposals as necessary.
d.2 Continuous variables and parameters
PClean allows users to include continuous variables in their models, either as parameters or attributes in class declarations. To handle these, we augment the inference algorithm in three additional ways:
Gibbs rejuvenation for parameter values. Continuous parameters are updated during SMC via separate Gibbs rejuvenation moves. PClean recognizes certain conjugate relationships between parameter hyperpriors and the attribute statements that use the parameters (e.g., Normal/Normal, Beta/Bernoulli, and Dirichlet/Categorical), and automatically exploits these for efficient and rejuvenation moves informed by all the relevant data. The inference engine tracks the relevant sufficient statistics as inference progresses, so these updates need not perform costly counts or summations.
Mixing with the prior for proposals of continuous attributes. Continuous attributes are handled as though they are discrete variables with ‘preferred values’ set to . The effect of this is that the locally optimal proposal for discrete variables is first derived without regard for the latent continuous attributes being proposed as part of the same subproblem (meaning that any likelihoods that depend on latent continuous attributes are not included during enumeration); then, once discrete values have been sampled, continuous values are sampled from their prior CPDs given any of their parent values (which may have been more intelligently proposed).
Particle Gibbs object-wise rejuvenation. Because the proposals generated by technique (2) for continuous variables may be poor, Metropolis-Hastings may often reject. To improve chances of acceptance, users can enable Particle Gibbs rejuvenation, which, in order to propose an update to an object of class , runs conditional SMC on the sequence of user-defined subproblems within class . Using Particle Gibbs, PClean can compensate for poorer proposals by sampling many weighted particles for each subproblem, which are combined into a joint proposal for the object. Note that without continuous variables, Metropolis-Hastings is generally preferred.
d.3 Optimality conditions for proposal compiler
The proposal compiler produces smart proposals by efficiently enumerating discrete variables (exploiting conditional independence) and computing only those likelihood terms that are necessary for a particular SMC or MCMC update. When all latent variables within a subproblem have finite discrete domains, and no variables have preferred values hints specified, the proposals PClean produces are locally optimal SMC proposals, as defined in [naesseth2019elements], or, for MCMC, exact blocked Gibbs rejuvenation kernels. However, introducing preferred-values hints that do not completely cover the posterior mass, or using continuous attributes within the subproblem, will lead to suboptimal (but faster-to-compute) proposals.
d.4 Observation hashing
Preferred values hints can help to limit the number of possibilities enumeration must consider for attribute values, but reference slots can also pose a problem: as the sequential Monte Carlo algorithm progresses, the latent database fills up with objects that could serve as possible targets, and considering each of them can be expensive.
In many models, however, the value of a reference slot is highly constrained by observations in . Consider an object of class with reference slot , and let be the set of slot chains connecting observation objects to objects of class . Given a query map , we can check if there exist any observed attributes that maps to a slot chain beginning . For each , let . Then the only objects of the target class that can possibly point to are
PClean can maintain, for each class, an index that maps values to sets of objects such that . PClean also maintains back-pointers from objects to the observation objects that reference them, and stores with each object the observed attribute values that constrain it. This allows PClean to compute the set of legal target objects for a given reference slot in time, which is constant in the number of latent objects for many models. (Indexing does require memory. Users can optionally control which values are indexed on by including statements within class declarations.) Of course, in some models and datasets, the size of the computed set of possible target objects may still be large, necessitating enumeration. But in common cases where the vast majority of possible targets have zero likelihood, this indexing plays a key role in helping PClean to scale to large datasets.
d.5 Quantified uncertainty
Because PClean is based on generative models, it is possible to quantify its uncertainty about particular cells. We ran an additional experiment to test the value of this on the Flights dataset. We performed ten independent runs of PClean, and changed a cell in the reconstructed flat table only if at least 70% of the runs agreed on its value. Below, we report accuracy under this metric, vs. the mean accuracy across the ten runs if they are trusted as 100% confident, at each iteration of rejuvenation sweeps following PClean’s SMC:
|Iterations||Uncertainty-Based Analysis F1/Rec./Prec.||Mean Individual F1/Rec./Prec.||# uncertain|
|0||No cells changed||0.001 / 0.002 / 0.001||9504|
|1||0.839 / 0.835 / 0.844||0.804 / 0.852 / 0.761||754|
|2||0.901 / 0.888 / 0.914||0.899 / 0.888 / 0.911||20|
|3||0.901 / 0.884 / 0.918||0.895 / 0.884 / 0.907||72|
|4||0.896 / 0.876 / 0.917||0.895 / 0.884 / 0.907||114|
|5||0.896 / 0.876 / 0.917||0.895 / 0.884 / 0.907||114|
|6||0.895 / 0.876 / 0.915||0.893 / 0.881 / 0.904||87|
|7||0.896 / 0.881 / 0.911||0.894 / 0.882 / 0.905||38|
|8||0.894 / 0.876 / 0.912||0.894 / 0.883 / 0.906||69|
|9||0.896 / 0.879 / 0.912||0.894 / 0.883 / 0.906||52|
We see that taking this uncertainty into account yielded higher precision, especially when inference was stopped early, without greatly compromising recall.
- For example, correspondence with some HoloClean authors yielded ways to improve HoloClean’s performance beyond previously published results, but did not yield ways for HoloClean to encode all forms of knowledge that PClean scripts can encode.
- See supplement for a discussion of this direction in the context of data cleaning; many datasets with cyclic links among classes (e.g. people who are friends with other people) can be modeled in PClean by introducing additional latent classes.
- However, note that BLOG also has limitations when it comes to expressing priors over link structure. It allows users to specify predicates that the targets of a reference slot must satisfy, and the choice is then assumed to be uniform among all objects satisfying the predicate. Thus, BLOG cannot express that certain objects are more “popular” targets of reference slots than others—an assumption that is built in to PClean’s Pitman-Yor-based model. We also note that by introducing additional classes, PClean can represent more interesting priors over link structure. For example, suppose and are two reference slots to objects from , and we wish each reference slot to be filled using different distributions over the objects in . We can create dummy classes for each reference slot, and , each with a single reference slot ( and ) to the target class . We then have the reference slots and target and respectively, instead of directly targeting . This implements a hierarchical Pitman-Yor process; by analogy with the HDP-LDA topic model, objects of and play the role of words from two different documents, and objects of class are the topics.