d-blink: Distributed End-to-End Bayesian Entity Resolution

d-blink: Distributed End-to-End Bayesian Entity Resolution

Neil G. Marchanta    Rebecca C. Steortsb    Andee Kaplanc    Benjamin I. P. Rubinsteina    Daniel N. Elazard
aSchool of Computing and Information Systems, University of Melbourne
bDepartment of Statistical Science, Duke University
cDepartment of Statistics, Colorado State University
dMethodology Division, Australian Bureau of Statistics
September 16, 2019
Abstract

Entity resolution (ER) (record linkage or de-duplication) is the process of merging together noisy databases, often in the absence of a unique identifier. A major advancement in ER methodology has been the application of Bayesian generative models. Such models provide a natural framework for clustering records to unobserved (latent) entities, while providing exact uncertainty quantification and tight performance bounds. Despite these advancements, existing models do not scale to realistically-sized databases (larger than 1000 records) and they do not incorporate probabilistic blocking. In this paper, we propose “distributed Bayesian linkage” or d-blink—the first scalable and distributed end-to-end Bayesian model for ER, which propagates uncertainty in blocking, matching and merging. We make several novel contributions, including: (i) incorporating probabilistic blocking directly into the model through auxiliary partitions; (ii) support for missing values; (iii) a partially-collapsed Gibbs samper; and (iv) a novel perturbation sampling algorithm (leveraging the Vose-Alias method) that enables fast updates of the entity attributes. Finally, we conduct experiments on five data sets which show that d-blink can achieve significant efficiency gains—in excess of 300—when compared to existing non-distributed methods.

\newcites

AppReferences


Keywords: auxiliary variable, distributed computing, Markov chain Monte Carlo, partially-collapsed Gibbs sampling, record linkage

missingFNUM@SECTION Introduction

Entity resolution (ER) (record linkage and de-duplication) is the process of identifying and merging records across data sources that refer to the same underlying entities. In the absence of unique identifiers (e.g. social security numbers), entities are resolved by searching for clusters of records that have sufficiently similar attributes. This can be a challenging task due to disparities between data sources, poor data quality and temporal variation.

Much of the ER literature advocates a pipeline approach (christen_data_2012; dong_big_2015) comprising four main stages. In the first stage, known as attribute/schema alignment, a set of common attributes are identified among the data sets. In the second stage, known as blocking or indexing, the set of all record pairs is pruned to yield a smaller set of candidate matches. This set is refined in the third stage, known as matching or scoring, typically by thresholding a weighted sum of pairwise attribute similarities. Finally, in the fourth stage, known as merging or fusion, the matching record pairs are mapped to consistent clusters, which are then merged to produce single representative records.

While a pipeline approach can be flexible and efficient at scale, it suffers from a major deficiency: uncertainties encountered at each stage are not propagated through the pipeline, nor are they included in the final output. As a result, an incorrect decision at an earlier stage (such as a poor blocking design) cannot be corrected at later stages. In short, information is lost and accuracy may suffer. Furthermore, analysts are unable to incorporate ER uncertainty in a post-ER analysis, e.g. an inferential or prediction task.

Bayesian modeling offers a promising solution to the problem of rigorous uncertainty propagation. Indeed, Bayesian modeling for ER has been an active area of research in recent years (copas_record_1990; belin_1995; larsen_2001; tancredi_hierarchical_2011; zhao_bayesian_2012; gutman_bayesian_2013; sadinle_detecting_2014; steorts_entity_2015; steorts_bayesian_2016; zanella_flexible_2016; sadinle_bayesian_2017). In addition to providing a principled treatement of uncertainty, the Bayesian paradigm allows for flexible modeling assumptions, including the incorporation of prior information. The latter can provide a regularizing effect which is useful when little or no training data is available. Recent work has also shown that theoretical performance bounds can be obtained for Bayesian ER models (steorts_performance_2017), whereas counterparts for rule-based and discriminative methods seem implausible.

Despite the successes of Bayesian models for ER, we are not aware of any models that scale to realistically-sized databases (larger than records). In conventional ER pipelines, scalability is addressed at the blocking/indexing stage, where a large fraction of the possible record comparisons are eliminated (christen_data_2012). “Traditional blocking” (steorts_comparison_2014) has been applied in an ad-hoc fashion to Bayesian ER models (steorts_bayesian_2016), however it does not preserve the posterior distribution, nor does it propagate uncertainty between the blocking and matching stages.

In this paper, we propose a scalable and distributed extension to the blink model for end-to-end Bayesian ER (steorts_entity_2015), which we call “distributed blink” or d-blink. Our key insight is an auxiliary variable representation that induces a partitioning over the entities and records, while preserving the marginal posterior. The partitions play a similar role as traditional blocks, however the assignment of entities and records to partitions is probabilistic and is inferred jointly with the other variables. This representation enables distributed inference at the partition level, as variables in distinct partitions become conditionally independent. The fact that this is possible gives the blink model a clear advantage in terms of scalability. It is unclear whether a similar approach can be adapted to other Bayesian ER models with more complex (e.g. non-parametric) linkage priors.

To construct the partitions, we propose a method based on -d trees which co-partitions similar entities, while achieving well-balanced partitions. We make several other novel contributions including support for record values missing completely at random; a partially-collapsed Gibbs sampler that balances constraints imposed by distributed computing; and insights into improving computational efficiency. The latter insights include: (i) a sub-quadratic algorithm for updating links based on indexing; (ii) a truncation approximation for the attribute similarities; and (iii) a novel perturbation sampling algorithm for updating the entity attributes, which leverages the Vose-Alias method (vose_linear_1991).

We implement our proposed methodology as an open-source Apache Spark package and conduct empirical evaluations on two synthetic and three real data sets. One of the synthetic data sets—RLdata10000—mimics administrative data with distortion and duplication and has been commonly used as a benchmark in the record linkage literature. The second synthetic data set (ABSEmployee) is motivated by real linkages of employment survey data undertaken at the Australian Bureau of Statistics (ABS). It is not derived from real data due to strict confidentiality restrictions. We include this data set in the supplementary material so that others may use it for evaluation of data linkage methods.

Additionally, we apply our methodology to three real applications of entity resolution, each of which is an important and challenging problem. First we link two snapshots of the North Carolina Voter Registration (NCVR) data set. It is necessary to identify duplicates within and across snapshots before computing statistics or performing prediction or inferential tasks. The second application is from the medical field, where it is common for records to be duplicated across locally-managed databases. Specifically, we examine de-identified person data from a longitudinal study called the National Long Term Care Survey (NLTCS). There are duplicates across all three waves of the study, which must be removed before any analysis can be performed. The third data set is derived from the 2008 and 2010 waves of the Survey of Household Income and Wealth (SHIW). It is a de-identified categorical data set and duplicates must be removed to draw valid statistical conclusions. We choose to work with three very different real applications to understand the scalability and accuracy of our methods, in cases where ground truth is known.

Our computational and sampling improvements can yield efficiency gains in excess of 300 compared to blink, and we observe near-linear gains with increasing numbers of partitions. Crucially, our results indicate that these efficiency dividends do not come at the cost of accuracy.

missingFNUM@SECTION Related work

The related work spans three key areas: ER methodology, inference for Bayesian ER models, and distributed MCMC.

Entity resolution methodology.

Matching pairs of tables was first addressed with a probabilistic approach by newcombe_automatic_1959. fellegi_theory_1969 later formalized this approach within the framework of decision theory. Many extensions to the Fellegi-Sunter (FS) model have been proposed (for surveys, see winkler_overview_2006; winkler2014matching), including a recent extension to multiple databases (sadinle_generalized_2013). However, the FS model has been criticized due to its lack of support for duplicates within databases; misspecified independence assumptions; and its dependence on subjective thresholds (tancredi_hierarchical_2011). These limitations have prompted development of more sophisticated Bayesian generative models, including models for bipartite matching (copas_record_1990; belin_1995; larsen_2001; tancredi_hierarchical_2011; gutman_bayesian_2013; sadinle_bayesian_2017), de-duplication (sadinle_detecting_2014; steorts_entity_2015; steorts_bayesian_2016) and matching across multiple databases (steorts_entity_2015; steorts_bayesian_2016). All of these models operate on structured data, and most (copas_record_1990; belin_1995; larsen_2001; tancredi_hierarchical_2011; gutman_bayesian_2013; steorts_bayesian_2016) rely solely on binary attribute-level comparisons to inform matching. Recent models (steorts_entity_2015; sadinle_detecting_2014; sadinle_bayesian_2017) additionally utilize attribute-level similarity scores.

Apart from these advances in Bayesian approaches to matching (largely undertaken in statistics), there have been an abundance of contributions from the database and machine learning communities (see surveys by getoor_entity_2012; christen_data_2012). Their focus has typically been on rule-based approaches (fan_reasoning_2009; singh_generating_2017), supervised learning approaches (mudgal_deep_2018), hybrid human-machine approaches (wang_crowder:_2012; gokhale_corleone:_2014), and scalability (papadakis_comparative_2016). Broadly speaking, all of these approaches rely on either humans in-the-loop or large amounts of labelled training data, which is not generally the case in the Bayesian setting.

Inference for Bayesian ER models.

Most prior work on Bayesian generative models for ER (e.g. tancredi_hierarchical_2011; gutman_bayesian_2013; steorts_entity_2015) has relied on Gibbs sampling for inference. Compared to other Markov chain Monte Carlo (MCMC) algorithms, Gibbs sampling is relatively easy to implement, however it may suffer from slow convergence and poor mixing owing to its highly local moves (liu_monte_2004).

In the broader context of clustering models, the split-merge algorithm (jain_split-merge_2004) has been proposed as an alternative to Gibbs sampling. It traverses the space of clusterings through proposals that split single clusters or merge pairs of clusters, and is thus less susceptible to becoming trapped in local modes. steorts_bayesian_2016 applied this algorithm to an ER model similar to blink. However, it remains unclear whether the split-merge algorithm is efficient for ER problems, where the number of clusters is expected to grow almost linearly in the number of records (zanella_flexible_2016).

More recently, zanella_flexible_2016 proposed the chaperones algorithm for inference in microclustering models. It is similar in spirit to the split-merge algorithm, however it is based on restricted Gibbs sampling rather than the Metropolis-Hastings framework. It “focuses on cluster reassignments with higher probabilities” by maintaining a biased distribution over the space of record pairs. However, its scalability is likely to be limited as a result.

In this work, we apply partially-collapsed Gibbs sampling (dyk_partially_2008), as we expect it will be more efficient for ER models than the aforementioned methods, while still yielding an improvement over regular Gibbs sampling.

Parallel/distributed MCMC.

Recent literature has focused on using parallel and distributed computing to scale up MCMC algorithms, where applications have included Bayesian topic models (newman_distributed_2009; smola_architecture_2010; ahn_distributed_2014) and mixture models (williamson_parallel_2013; chang_parallel_2013; lovell_clustercluster:_2013; ge_distributed_2015). We review the application to mixture models, as they are conceptually similar to ER models.

Existing work has concentrated on Dirichlet process (DP) mixture models and Hierarchical DP mixture models. The key to enabling distributed inference for these models is the realization that a DP mixture model can be reparameterized as a mixture of DPs. Put simply, the reparameterized model induces a partitioning of the clusters, such that clusters assigned to distinct partitions are conditionally independent. As a result, variables within partitions can be updated in parallel.

williamson_parallel_2013 exploited this idea at the thread level to parallelize inference for a DP mixture model. chang_parallel_2013 followed a similar approach, but included an additional level of parallelization within partitions using a parallelized version of the split-merge algorithm. Others (lovell_clustercluster:_2013; ge_distributed_2015) have developed distributed implementations in the Map-Reduce framework.

We do not consider DP mixture models in our work, as their behaviour is ill-suited for ER applications.111With a DP prior, the number of clusters grows logarithmically in the number of records, but empirical observations call for near-linear growth (zanella_flexible_2016). However we do borrow the reparameterization idea, albeit with a more flexible partition specification which permits similar entities to be co-partitioned and facilitates load balancing. It would be interesting to see whether similar ideas can be applied to microclustering models (zanella_flexible_2016), however preserving the marginal posterior distribution seems challenging in this case.

missingFNUM@SECTION A scalable model for Bayesian ER

We now present our extension to the blink model for Bayesian ER (steorts_entity_2015), which incorporates auxiliary partitions, support for missing values, and generic attribute similarity functions. We describe notation and assumptions in Section 3.1, before outlining the generative process in Section 3.2. Attribute similarity measures are defined in Section 3.3, including a truncation approximation which improves scalability. In Section 3.4, we demonstrate that the marginal posterior of d-blink is equivalent to blink under certain conditions. Finally, in Section 3.5 we provide intuition for the auxiliary partition representation.

3.1 Notation and assumptions

Consider a collection of tables222We define a table as an ordered (indexed) collection of records, which may contain duplicates (records for which all attributes are identical). (databases) indexed by , each with records (rows) indexed by and aligned attributes (columns) indexed by . We assume each record links to a single entity, denoted by , from a fixed population of entities of size indexed by . The value of the -th attribute for record is denoted by , and is assumed to be a noisy observation of the linked entity’s true attribute value . We allow for the fact that some attributes may be missing completely at random through a corresponding indicator variable  (little_statistical_2002, p. 12).

Table 1 summarizes our notation, including model-specific variables which will be introduced shortly. We adopt the following rules to compactly refer to sets of variables:

  • A boldface lower-case variable denotes the set of all attributes: e.g. .

  • A boldface capital variable denotes the set of all index combinations: e.g. .

We also define notation to separate the record attributes into an observed part (those ’s for which ) and a missing part (those ’s for which ).

After specifying a model, our goal is to perform end-to-end ER, by inferring the joint posterior distribution over the links and true entity attribute values . We condition only on the observed record attribute values . In other words, we operate in a fully unsupervised setting, assuming no ground truth data is available for the links or entities.

Note that inferring is equivalent to the matching stage of the ER pipeline, while inferring is equivalent to the merging stage. By inferring and jointly, we are able to propagate uncertainty between the two stages. We’ll later show how to incorporate probabilistic blocking, while allowing full propagation of uncertainty.

3.2 Model specification

We now describe the generative process for our proposed model d-blink. A plate diagram is depicted in Figure 1, with key differences from blink highlighted in a dashed blue line style.

Figure 1: Plate diagram for d-blink. Extensions to blink are highlighted in a dashed blue line style. Circular nodes represent random variables; square nodes represent deterministic variables; (un)shaded nodes represent (un)observed variables; arrows represent conditional dependence; and plates represent replication over an index.
Symbol Description Symbol Description
index over tables assigned partition for record in table
index over records in table assigned entity for record in table
index over entities prob. attribute in table is distorted
index over partitions distortion hyperparams. for attribute
index over attributes prob. attribute in table is observed
index over domain of attribute domain of attribute
total number of records distribution over domain of attribute
attribute for record in table similarity measure for attribute
distortion indicator for set of records assigned to entity
observed indicator for set of entities assigned to partition
attribute for entity partition assignment function
Table 1: Summary of notation.
Entities.

Each entity in the population (of fixed size ) is described by “true” attributes. The value of attribute for entity is denoted by and is assumed to be drawn independently from the domain of the attribute according to the empirical distribution (derived from the tables) :

(1)
Partitions.

d-blink deviates from blink by splitting the entities into partitions333In the database and distributed computing communities, partition may refer to a part of divided object. We use the term in this sense, which differs from the usage in set theory. according to their attribute values. The assignment of entities to partitions is achieved through a deterministic partition function:

(2)

where is the Kronecker product. Further details are given in Section 4, along with an example based on -d trees. We also introduce the notation , which allows us to concisely refer to the set of entities assigned to partition .

Distortion.

Associated with each table and attribute is a distortion probability , with assumed prior distribution:

(3)

where and are hyperparameters. We provide recommendations for setting and in Appendix F. The distortion probabilities feed into the record-generation process below.

Records.

We assume the records are generated by selecting an entity uniformly at random and copying the entity’s attributes subject to distortion. The process for generating record in table is outlined below. Steps (i), (ii), and (v) deviate from blink.

  1. Choose a partition assignment at random in proportion to the partition sizes:

    (4)
  2. Choose an entity assignment uniformly at random from partition :

    (5)
  3. For each attribute , draw a distortion indicator :

    (6)
  4. For each attribute , draw a record value :

    (7)

    where represents a point mass. If , is copied directly from the entity. Otherwise, is drawn from the domain according to the distortion distribution . In the literature, this is known as a hit-miss model (copas_record_1990).

  5. For each attribute , draw an observed indicator :

    (8)

    If , is observed, otherwise it is missing.

Detail on the distortion distribution.

chooses a distorted value for attribute conditional on the true value . In our parameterization of the model, it is defined as

(9)

where is a normalization constant and is the similarity measure for attribute  (see Section 3.3). Intuitively, this distribution chooses values in proportion to their empirical frequency, while placing more weight on those that are “similar” to . This reflects the notion that distorted values are likely to be close to the truth, as is the case when modeling typographical errors.

3.3 Attribute similarity measures

We now discuss the attribute similarity measures that appear in the distortion distribution of Equation (9). The purpose of these measures is to quantify the propensity that some value in the attribute domain is chosen as a distorted alternative to the true value .

Definition (Attribute similarity measure).

Let be the domain of an attribute. An attribute similarity measure on is a function that satisfies and for all .

Note that our parameterization in terms of attribute similarity measures differs from blink, which uses distance measures. This allows us to make use of a more efficient sampling method, as described in Section 6.3. The next proposition states that the two parameterizations are in fact equivalent, so long as the distance measure is bounded and symmetric (a proof is provided in Appendix B.1).

Proposition 1.

Let be the attribute distance measure that appears in blink, and assume that and for all . Define the corresponding attribute similarity measure for d-blink as

(10)

Then the parameterization of used in d-blink is equivalent to blink.

In this paper, we restrict our attention to the following similarity measures for simplicity:

  • Constant similarity measure. This measure is appropriate for categorical attributes, where there is no reason to believe one value is more likely than any other as a distortion to the true value . Without loss of generality, it may be defined as for all .

  • Normalized edit similarity measure. This measure is based on the edit distance metric, and is suitable for modeling distortion in generic string attributes. Following yujian_normalized_2007, we define a normalized edit distance metric,

    where denotes the regular edit distance and denotes the length of string . Note that alternative definitions of the normalized edit distance could be used (see references in yujian_normalized_2007), however the above definition is unique in that it yields a proper metric. Since the normalized edit distance is bounded on the interval we can define a corresponding normalized edit similarity measure:

    (11)

Ideally, one should select attribute similarity measures based on the data at hand. There are many possibilities to consider, such as Jaccard similarity, numeric similarity measures (lesot_similarity_2008) and other domain-specific measures (bilenko_adaptive_2003).

3.4 Posterior distribution and equivalence

Having described the d-blink model, we now consider the joint posterior distribution over the latent (unobserved) parameters. By reading the conditional dependence structure from the plate diagram (Figure 1) and marginalizing over the missing record attributes , one can show that the posterior distribution is of the following form:

(12)

For further detail on the derivation and an expanded form of the posterior, we refer the reader to Appendix A.

The structure of the posterior for d-blink is very similar to the one for blink, apart from the factors for and and the introduction of the “observed” indicator variables (as highlighted in Figure 1). In fact, one can show that d-blink is equivalent to blink in the following sense:

Proposition 2.

Suppose the conditions of Proposition 1 hold and that and for all . Assume furthermore that all record attributes are observed, i.e.  for all . Then the marginal posterior of , , and for d-blink (i.e. marginalized over ) is identical to the posterior for blink.

A proof is provided in Appendix B.2.

3.5 Rationale for introducing partitions

We now briefly explain the role of the auxiliary partitions in d-blink. Firstly, we note that without the partitions (), the Markov blanket for includes the attribute values for all of the entities . This presents a major obstacle when it comes to distributing the inference on a compute cluster, as the data is not separable. By incorporating partitions, we restrict the Markov blanket for to include only a subset of the entity attribute values—those in the same partition as record . As a result, it becomes natural to distribute the inference so that each compute node is responsible for a single partition (see Section 5.2 for details). Secondly, we can interpret the partitions as performing a probabilistic blocking in the context of MCMC sampling (introduced in Section 5), which improves computational efficiency. In a given iteration, the possible links for a record are restricted to the entities residing in the same partition. However, unlike conventional blocking, the partition assignments are not fixed—between iterations the entities and linked records may move between partitions.

missingFNUM@SECTION Partition functions

In Section 3.2 we defined a generic partition function (Equation 2) that is responsible for assigning entities to partitions. This function may be regarded as a free parameter, since it has no bearing on the model equivalence by Theorem 2. However, from a practical perspective it ought to be chosen carefully, as it can impact the efficiency of the inference. We suggest some guidelines for choosing a partition function in Section 4.1, before presenting an example based on -d trees in Section 4.2.

4.1 Interpretation and guidelines

Recall that the partition function assigns an entity to a partition according to its attributes . Since is unobserved, it must be treated as a random variable over the space of possible attributes . This means the partition function should not be interpreted as partitioning the entities directly. Rather, it should be interpreted as partitioning the space in which the entities reside, while taking the distribution over into account. With this interpretation in mind, we argue that the partition function should ideally satisfy the following properties:

  1. Balanced weight. The partitions should have equal weight (probability mass) under the distribution over . This ensures the expected number of entities assigned to each partition is equal, which results in proper load balancing.

  2. Entity separation. A pair of entities drawn at random from the same partition should have a high degree of similarity, while entities drawn from different partitions should have a low degree of similarity. This improves the likelihood that similar records will end up in the same partition, and allows them to more readily form likely entities.

These properties need not be satisfied strictly: the extent to which they are satisfied is merely expected to improve the efficiency of the inference. For example, satisfying the first property requires knowledge of the marginal posterior distribution over , which is infeasible to calculate. We note that there is likely to be tension between the two properties, so that a balance must be struck between them.

4.2 Example: k-d tree partition function

We now describe a partition function based on -d trees, which is used in our experiments in Section 7.

Background.

A -d tree is a binary tree that recursively partitions a -dimensional affine space (bentley_multidimensional_1975; friedman_algorithm_1977). In the standard set-up, each node of the tree is associated with a data point that implicitly splits the input space into two half-spaces along a particular dimension. Owing to its ability to hierarchically group nearby points, it is commonly used to speed up nearest-neighbor search. This makes a -d tree a good candidate for a partition function, since it can be balanced while grouping similar points.

Setup.

Our setup differs from a standard -d tree in several aspects. First, we consider a discrete space (not an affine space), where the “ dimensions” are the attributes. Second, we do not store data points in the tree. We only require that the tree implicitly stores the boundaries of the partitions, so that it can assign an arbitrary to the correct partition (a leaf node). Finally, since we are working in a discrete space, the input space to a node is a countable set. The node must split the input set into two parts based on the values of one of the attributes.

Fitting the tree.

Since it is infeasible to calculate the marginal posterior distribution over exactly, we use the empirical distribution from the tables as an approximation. As a result, we treat the records (tables) as a sample from the distribution over , and fit the tree so that it remains balanced with respect to this sample. The depth of the tree determines the number of partitions ().

Achieving balanced splits.

When fitting the tree, each node receives an input set of samples and a rule must be found that splits the set into two roughly equal (balanced) parts based on an attribute. We consider two types of splitting rules: the ordered median and the reference set (see Appendix C). We allow the practioner to specify an ordered list of attributes to be used for splitting. To ensure balanced splits, we recommend selecting attributes with a large domain. If possible, we recommend preferencing attributes which are known a priori to be reliable (low distortion), as this will reduce the shuffling of entities/records between partitions. In principle, it is possible to automate the process of fitting a tree: one could grow several trees with randomly-selected splits and use the one that is most balanced. We examine balance empirically in Appendix H.

missingFNUM@SECTION Inference

We now turn to approximating the full joint posterior distribution over the unobserved variables , , , and , as given in Equation (12). Since it is infeasible to sample from this distribution directly, we use an MCMC algorithm called partially-collapsed Gibbs (PCG) sampling (dyk_partially_2008). In addition, we show how to exploit the conditional independence induced by the partitions to distribute the PCG sampling across multiple threads or machines.

5.1 Partially-collapsed Gibbs sampling

Following the original blink paper (steorts_entity_2015), we initially experimented with regular Gibbs sampling. However, the resulting Markov chains exhibited slow convergence and poor mixing. This is a known shortcoming of Gibbs sampling which may be remedied by collapsing variables and/or updating correlated variables in groups (liu_monte_2004). These ideas form the basis for a framework called partially-collapsed Gibbs (PCG) sampling—a generalization of Gibbs sampling that has “much better convergence properties” (dyk_partially_2008).

Under the PCG framework, variables are updated in groups by sampling from their conditional distributions. These conditional distributions may be taken with respect to the joint posterior (like regular Gibbs), or with respect to marginal distributions of the joint posterior (unique to PCG). The latter case is called trimming and must be handled with care so as not to alter the stationary distribution of the Markov chain.

In applying PCG sampling to d-blink, we must decide how to apply the three tools: marginalization (equivalent to grouping), permutation (changing the order of the updates) and trimming (removing marginalized variables). In theory, the convergence rate should improve with more marginalization and trimming, however this must be balanced with the following: (i) whether the resulting conditionals can be sampled from efficiently, and (ii) whether the resulting dependence structure is compatible with our distributed set-up (see Section 5.2). We consider two samplers, PCG-I and PCG-II, described below. Of the two, we recommend PCG-I as it is more efficient in our empirical evaluations (see Section 7.1). We include the PCG-II sampler, as one would expect the PCG-II sampler to perform better than the PCG-I sampler in terms of mixing, however when computational efficiency is taken into account the performance is worse (see Figure 6).

Step 1
Step 2
Step 3
Step 4
Update on the manager and broadcast to the workers.
Update on the workers. Records may only link to entities within their assigned partitions. Update and on the workers. Then move the entities and records to their newly-assigned partitions. Update , then calculate summary stats on the workers. Broadcast to the manager.
stats
stats
Figure 2: Schematic depicting a single iteration of distributed PCG sampling. The entity attributes (—circular nodes), record attributes and their distortion indicators (, —square nodes), and links from records to entities (—node connectors) are distributed across the workers (blue rectangular plates) according to their assigned partitions. The distortion probabilities () reside on the manager (green rounded-rectangular plate).

5.1.1 PCG-I sampler

The PCG-I sampler uses regular Gibbs updates for , and for all , and . The conditional distributions for these updates are listed in Appendix D. When updating the entity attributes and the record partition assignments , marginalization and trimming are used. Specifically, we apply marginalization by jointly updating and (the set of ’s and ’s for records linked to entity ). We then trim (analytically integrate over) .

We shall now derive this update. Referring to Equation (12), the joint posterior of , conditioned on the other parameters has the form

where superscript denotes exclusion of any records (those currently linked to entity ). Substituting the distributions and trimming yields

(13)

where

and

Note that the update for is deterministic, conditional on and .

Since we have applied trimming, we must permute the updates so that the trimmed variables are not conditioned on in later updates. This means the updates for and must come after the updates for and , but before the updates for .

5.1.2 PCG-II sampler

The PCG-II sampler is identical to PCG-I, except that it replaces the regular Gibbs update for with an update that marginalizes and trims . To derive the distribution for this update, we first consider the joint posterior of and conditioned on the other parameters:

where superscript denotes exclusion of record . Substituting the distributions and trimming yields

(14)
Update variables Dependencies
, , , ,
, , ,
, , ,
Table 2: Dependencies for the conditional updates used in PCG-I.

5.2 Distributing the sampling

By examining the conditional distributions derived in the previous section and those listed in Appendix D, one can show that the updates for the variables associated with entities and records (, , and ) only depend on variables associated with entities and records assigned to the same partition (excluding ). These dependencies are summarized in Table 2 for the PCG-I sampler. The distortion probability is an exception—it is not associated with any partition and may depend on ’s across all partitions.

This dependence structure—in particular, the conditional independence of entities and records across partitions—makes the PCG sampling amenable to distributed computing. As such, we propose a manager-worker architecture where:

  • the manager is responsible for storing and updating variables not associated with any partition (i.e. ); and

  • each worker represents a partition, and is responsible for storing and updating variables associated with the entities and records assigned to it.

The manager/workers may be processes running on a single machine or on machines in a cluster. If using a cluster, we recommend that the nodes be tightly coupled, as frequent communication between them is required.

Figure 2 depicts a single iteration of PCG sampling using our proposed manager-worker architecture. Of the four steps depicted, steps 2 and 3 are the most computationally intensive, so we can potentially achieve a significant speed-up by distributing them across the workers. The communication cost is likely to be greatest in step 3, where the entities and linked records are shuffled to their newly-assigned partitions. A well-chosen partition function can minimize this cost. If similar records/entities are co-partitioned, there should be little movement between partitions. It is important to note that Figure 2 constitutes a single iteration of PCG sampling, or more precisely, a single application of the Markov transition operator. To generate a Markov chain of length , we begin with some initial state and iteratively apply the transition operator times.

missingFNUM@SECTION Computational efficiency considerations

6.1 Efficient pruning of candidate links

In this section, we describe a trick that is aimed at improving the computational efficiency of the Gibbs update for (used in the Gibbs and PCG-I samplers). This particular trick does not apply to the joint PCG update for and (used in the PCG-II sampler).

Consider the conditional distribution for the update in Equation (S5) of Appendix D:

(15)

The support of this distribution is the set of candidate links for record , which we denote by . Looking at the first indicator function above, we see that , i.e. the candidate links are restricted to the entities in the same partition as record . Thus, a naïve sampling approach for this distribution takes time.

We can improve upon the naïve approach by exploiting the fact that is often considerably smaller than . To see why this is the case, note that the second indicator function in Equation (15) further restricts if any of the distortion indicators for the observed record attributes are zero. Specifically, if and , cannot contain any entity whose -th attribute does not match the record’s -th attribute . This implies is likely to be small in the case of low distortion.

Putting aside the computation of for the moment, this means we can reduce the time required to update to . To compute efficiently, we propose maintaining an inverted index over the entity attributes within each partition. Specifically, the index for the -th attribute in partition  should accept a query value and return the set of entities that match on :

(16)

Once the index is constructed, we can efficiently retrieve the set of candidate links for record by computing a multiple set intersection:

(17)

This assumes at least one of the observed record attributes is not distorted. Otherwise .

Since the sizes of the sets are likely to vary significantly, we advise computing the intersection iteratively in increasing order of size. That is, we begin with the smallest set and retain the elements that are also in the next largest set, and so on. With a hash-based set implementation, this scales linearly in the size of the first (smallest) set.

6.2 Caching and truncation of attribute similarities

We have not yet emphasized that the updates for , and depend on the attribute similarities between pairs of values in the attribute domains. Specifically, for each attribute , we need to access the indexed set . These similarities may be expensive to evaluate on-the-fly, so we cache the results in memory on the workers.

To manage the quadratic scaling of , and in anticipation of another trick introduced in Section 6.3, we transform the similarities so that those below a cut-off are regarded as completely disagreeing. We achieve this by applying the following truncation transformation to the raw attribute similarity :

(18)

as illustrated in Figure 3.

Figure 3: Transformation from a raw similarity function () to a truncated similarity function ().

Whenever a raw attribute similarity is called for, we replace it with this truncated version. Only pairs of values with positive truncated similarity are stored in the cache—those not stored in the cache have a truncated similarity of zero by default. Note that attributes with a constant similarity function are treated specially—there is no need to cache the index set of similarities, since they are all identical.

It is important to acknowledge that the truncated similarities are an approximation to the original model. We claim that the approximation is reasonable on the following grounds:

  • Low loss of information. Below a certain cut-off, the attribute similarity function is unlikely to encode much useful information for modeling the distortion process. For example, the fact that whereas , doesn’t necessarily suggest that “Chiu” is more likely than “Chen” as a distorted alternative to “Smith”.

  • Precedent. In the record linkage literature, value pairs with similarities below a cut-off are regarded as completely disagreeing (winkler_methods_2002; enamorado_using_2017).

  • Efficiency gains. As we shall soon see in Section 6.3, we can perform the combined , , update more efficiently by eliminating pairs below the cut-off from consideration.

6.3 Fast updates of entity attributes using perturbation sampling

We now present a novel sampling algorithm that allows us to efficiently perform the PCG update for and . The algorithm relies on the observation that the conditional distribution for can be expressed as a mixture over two components:

  1. a base distribution over which is ideally constant for all entities; and

  2. a perturbation distribution which varies for each entity, but has a much smaller support than .

With this representation, we can avoid computing and sampling from the full distribution over , which varies for each update. Rather, we only need to compute the perturbation distribution over a much smaller support, and then sample from the mixture, which can be done efficiently using the Vose-Alias method (vose_linear_1991). We refer to this algorithm as perturbation sampling.

6.3.1 Perturbation sampling

Although we’re interested in applying perturbation sampling to a specific conditional distribution, we describe the idea in generality below.

Consider a target probability mass function (pmf) with finite support , which varies as a function of parameters . In general, one must recompute the probability tables to draw a new variate whenever changes—a computation that takes time. However, if the dependence on is of a certain restricted form, we show that it is possible to achieve better scalability by expressing the target as a mixture. This is made precise in the following result.

Proposition 3.

Let be a pmf with finite support , which depends on parameters . Suppose there exists a “base” pmf over which is independent of and a non-negative bounded perturbation term , such that can be factorized as . Then can be expressed as a mixture over the base pmf and a “perturbation” pmf over as follows:

(19)

where .

Proof.

The result is straightforward to verify by substitution. ∎

Algorithm S1 (in Appendix E) shows how to apply this result to draw random variates from a target pmf. Briefly, it consists of three steps: (i) the perturbation pmf and its normalization constant are computed; (ii) a biased coin is tossed to choose between the base pmf and the perturbation pmf ; and (iii) a random variate is drawn from the selected pmf. If is selected, a pre-initialized Alias sampler is used to draw the random variate (reused for all ). Otherwise if is selected, a new Alias sampler is instantiated. The result below states the time complexity of this algorithm (see Appendix E for a proof).

Proposition 4.

Algorithm S1 (in Appendix E) returns a random variate from the target pmf for any in time.

This is a promising result, since the size of the perturbation support is typically of order 10 for our application, while the size of the full support may be as large as . Hence, we expect a significant speed-up over the naïve approach.

6.3.2 Application of perturbation sampling

We now return to our original objective: performing the joint PCG update for and . Referring to Equation (13), we can express the conditional distribution for (i.e. the target distribution) as

(20)

The base distribution is given by

(21)

where is the number of records linked to entity with observed values for attribute ; and the perturbation term is given by

(22)

The full support of the target pmf is , while the perturbation support is given by

In words, this set consists of the observed values for attribute  in the records linked to entity , plus any sufficiently similar values from the attribute domain (for which the truncated similarity is non-zero). The size of the perturbation set will vary depending on the cut-off used for the truncation transformation—the higher the cut-off, the smaller the set. This implies that there is a trade-off between efficiency (small perturbation set) and accuracy (lower cut-off).

Remark.

The astute reader may have noticed that the base distribution given in Equation (21) is not completely independent of the conditioned parameters, as is required by Proposition 3. In particular, depends on —roughly the size of entity . Fortunately, we expect the range of regularly encountered entity sizes to be small, so we sacrifice some memory by instantiating multiple Alias samplers for each in some expected range. In the worst case, when a value is encountered outside the expected range and the base distribution is required (unlikely since the weight on the base component is typically small), we instantiate the base distribution on-the-fly (same asymptotic cost as the naïve approach).

missingFNUM@SECTION Empirical evaluation

Data set # records () # files () # entities # attributes ()
categorical string
ABSEmployee 600,000 3 400,000 4 0
NCVR 448,134 2 296,433 3 3
NLTCS 57,077 3 34,945 6 0
SHIW0810 39,743 2 28,584 8 0
RLdata10000 10,000 1 9,000 2 3
Table 3: Summary of data sets. Those marked with a ‘’ are synthetic.

We evaluated d-blink using two synthetic and three real data sets, as summarized in Table 3. All results presented here were obtained using our local server in pseudocluster mode, however some were replicated on a cluster in the Amazon public cloud (see Appendix G) to test the effect of higher communication costs. Further details about the data sets, hardware, implementation and parameter settings are provided in Appendix F.

7.1 Computational and sampling efficiency

Following turek_efficient_2016, we measured the efficiency of d-blink using the rate of effective samples produced per unit time (ESS rate), which balances sampling efficiency (related to mixing/autocorrelation) and computational efficiency. We used the mcmcse R package (mcmcse) to compute the effective sample size (ESS), which implements a multivariate method proposed by vats_multivariate_2015.

Since the number of variables in the model is unwieldy (there are at least unobserved variables) we computed the ESS for the following summary statistics:

  • the number of observed entities (scalar);

  • the aggregate distortion for each attribute (vector); and

  • the cluster size distribution (vector containing frequency of 0-clusters, 1-clusters, 2-clusters, etc.).

d-blink versus blink.

We compared d-blink (using the PCG-I sampler) to our own implementation of blink (i.e. a Gibbs sampler without any of the tricks described in Section 6). For a fair comparison, we switched off partitioning in d-blink. We used the relatively small RLdata10000 data set, as blink cannot cope with larger data sets. Figure 4 contains trace plots for two summary statistics as a function of running time. It is evident that blink has not converged to the equilibrium distribution within the allotted time of 11 hours, while d-blink converges to equilibrium in 100 seconds. Looking solely at the time per iteration, d-blink is at least 200 faster than blink.

Figure 4: Comparison of convergence rates for d-blink and blink. The summary statistics for d-blink (number of observed entities on the left and attribute distortions on the right) rapidly converge to equilibrium, while those for blink fail to converge within 11 hours.
Partitioning and efficiency.

We tested the effect of varying the number of partitions on the efficiency of d-blink. For each value of , we computed the ESS rate averaged over 3000 iterations. We used the NLTCS data set and the PCG-I sampler. Figure 5 presents the results in terms of the speed-up relative to the ESS rate for . We observe a near-linear speed-up in , with the exception of . The speed-up is expected to taper off with increasing numbers of partitions, as parallel gains in efficiency are overcome by losses due to communication costs and/or poorer mixing. This tipping point seems difficult to predict for a given set up, as it depends on complex factors such as the data distribution, the splitting rules used, and the hardware characteristics.

Figure 5: Efficiency of d-blink as a function of the number of partitions and summary statistic of interest (larger is better). The speed-up measures the ESS rate relative to the ESS rate for (no partitioning) for the NLTCS data set.
Sampling methods and efficiency.

We evaluated the efficiency of the three samplers introduced in Section 5.1 (Gibbs, PCG-I and PCG-II). As above, we computed the ESS rate as an average over 3000 iterations. We set and used the NLTCS data set. The results, shown in Figure 6, indicate that the PCG-I sampler is considerably more efficient (by a factor of 1.5–2) than the baseline Gibbs sampler for this data set. We also observe that the PCG-II sampler performs quite poorly in comparison: between 20–30 slower than the Gibbs sampler. This is because the marginalization and trimming for the update for PCG-II prevents us from applying the trick described in Section 6.1. Thus although PCG-II is expected to be more efficient in terms of reducing autocorrelation, it is less efficient overall as each iteration is too computationally expensive.

Figure 6: Efficiency of d-blink as a function of the sampler and summary statistic of interest (larger is better). All measurements are for the NLTCS data set with .

7.2 Linkage quality

Though not our primary focus, we assessed the performance of d-blink in terms of its predictions for the linkage structure (the matching step) for the data sets in Table 3. This was not previously possible with blink, as it could only scale to small data sets of around 1000 records (steorts_entity_2015).

Point estimate methodology.

To evaluate the matching performance of d-blink with respect to the ground truth, we extracted a point estimate of the linkage structure from the posterior using the shared most probable maximal matching sets (sMPMMS) method (steorts_bayesian_2016). This method circumvents the problem of label switching (jasra_markov_2005)—where the identities of the entities do not remain constant along the Markov chain.

The sMPMMS method involves two main steps. In the first step, the most probable entity cluster is computed for each record based on the posterior samples. In general, these entity clusters will conflict with one another—e.g. the most probable entity cluster for might be while for it is . The second step resolves these conflicts by assigning precedence to links between records and their most probable entity clusters. The result is a globally-consistent estimate of the linkage structure—i.e. it satisfies transitivity.

We distributed the computation of the sMPMMS method in the Spark framework. We used approximate posterior samples which were derived from a Markov chain of length by discarding the first iterations as burn-in444We applied a burn-in of 60k iterations for NCVR as it was slow to converge. and applying a thinning interval of 10. These parameters were chosen by inspection of trace plots, some of which are reported in Appendix K. By contrast to the point estimates reported here, we also examined full posterior estimation in Appendix K.

Baseline methods.

We compared the linkage quality of d-blink with three baseline methods as described below. We focused on scalable methods as we assumed very little to no training data was available.

Data set Method Pairwise measures Cluster measures
Precision Recall F1-score ARI Err. # clust.
ABSEmployee d-blink 0.9763 0.8530 0.9105 0.9105 +1.667%
Fellegi-Sunter (10) 0.9963 0.8346 0.9083
Fellegi-Sunter (100) 0.9963 0.8346 0.9083
Near Matching 0.0378 0.9930 0.0728
Exact Matching 0.9939 0.8346 0.9074 0.9074 +9.661%
NCVR d-blink 0.9145 0.9653 0.9392 0.9392 –3.562%
Fellegi-Sunter (10) 0.9868 0.7874 0.9083
Fellegi-Sunter (100) 0.9868 0.7874 0.9083
Near Matching 0.9899 0.7443 0.8497
Exact Matching 0.9925 0.0017 0.0034 0.0034 +51.09%
NLTCS d-blink 0.8319 0.9103 0.8693 0.8693 –22.09%
Fellegi-Sunter (10) 0.9094 0.9087 0.9090
Fellegi-Sunter (100) 0.9094 0.9087 0.9090
Near Matching 0.0600 0.9563 0.1129
Exact Matching 0.8995 0.9087 0.9040 0.9040 +2.026%
SHIW0810 d-blink 0.2514 0.5396 0.3430 0.3429 –37.65%
Fellegi-Sunter (10) 0.0028 0.9050 0.0056
Fellegi-Sunter (100) 0.0025 0.9161 0.0050
Near Matching 0.0043 0.9111 0.0086
Exact Matching 0.1263 0.7608 0.2166 0.2166 –37.40%
RLdata10000 d-blink 0.6334 0.9970 0.7747 0.7747 –10.97%
Fellegi-Sunter (10) 0.9957 0.6174 0.7622
Fellegi-Sunter (100) 0.9364 0.8734 0.9038
Near Matching 0.9176 0.9690 0.9426
Exact Matching 1.0000 0.0080 0.0159 0.0159 +11.02%
Table 4: Comparison of matching quality. “ARI” stands for adjusted Rand index and “Err. # clust.” is the percentage error in the number of clusters.
  • Exact Matching. Links records that match on all attributes. It is unsupervised and ensures transitivity.

  • Near Matching. Links records that match on at least attributes. It is unsupervised, but does not guarantee transitivity.

  • Fellegi-Sunter. Links records according to a pairwise match score that is a weighted sum of attribute-level dis/agreements. The weights are specified by the Fellegi-Sunter model (fellegi_theory_1969) and were estimated using the expectation-maximization algorithm, as implemented in the RecordLinkage R package (sariyar_recordlinkage_2010). We chose the threshold on the match score to optimize the F1-score using a small amount of training data (size 10 and 100). This makes the method semi-supervised. Note that the training data was sampled in a biased manner to deal with the imbalance between the matches and non-matches (half with match scores above zero and half below). The method does not guarantee transitivity.

Results.

Table 4 presents performance measures categorized by data set and method. The pairwise performance measures (precision, recall and F1-score) are provided for all methods, however the cluster performance measures (adjusted Rand Index, see vinh_information_2010, and percentage error in the number of clusters) are only valid for methods that guarantee transitivity of closure (d-blink and Exact Matching). Despite being fully unsupervised, d-blink achieves competitive performance when compared to the semi-supervised Fellegi-Sunter method. The two simple baselines, Near Matching and Exact Matching, are acceptable for data sets with low noise but perform poorly otherwise (e.g. NCVR and RLdata10000). We conducted an empirical sensitivity analysis for d-blink with respect to variations in the hyperparameters. The results for RLdata10000 (included in Appendix J) show that d-blink is somewhat sensitive to all of the hyperparameters tested, however sensitivity is in general predictable, following clear and intuitive trends. One interesting observation is the fact that d-blink tends to overestimate the amount of distortion. This is perhaps not surprising given the absence of ground truth.

missingFNUM@SECTION Concluding remarks

We have proposed d-blink: a distributed, scalable extension of the blink model for unsupervised Bayesian ER. Our key insight was an auxiliary variable representation of the model that induces a partitioning over the entities and records. This integrates a kind of “blocking step” into the model, which improves scalability and enables distributed inference, while allowing blocking and ER errrors to propagate. We also contributed several ideas for improving the efficiency of the inference, including the use of partially-collapsed Gibbs sampling, inverted indices for updating links and a novel perturbation sampling algorithm for updating entity attributes. Our experiments show that d-blink can achieve significant efficiency gains compared to blink. With additional computing resources, we expect d-blink could be scaled to databases with 1–10 million records.

SUPPLEMENTAL MATERIALS

Code:

An implementation of d-blink built on the Apache Spark distributed computing framework (Zip file). The source code is also hosted at https://github.com/ngmarchant/dblink.

Data:

An archive containing data sets that we have permission to redistribute (Zip file).


ACKNOWLEDGEMENTS

N. Marchant acknowledges the support of an Australian Government Research Training Program Scholarship and the AMSIIntern program hosted by the Australian Bureau of Statistics. R. C. Steorts and A. Kaplan acknowledge the support of NSF SES-1534412 and CAREER-1652431. B. Rubinstein acknowledges the support of Australian Research Council grant DP150103710. N. Marchant and B. Rubinstein also acknowledge support of Australian Bureau of Statistics project ABS2018.363.

References

Appendices for “d-blink: Distributed End-to-End Bayesian Entity Resolution” Neil G. Marchanta Rebecca C. Steortsb Andee Kaplanb Benjamin I. P. Rubinsteina Daniel N. Elazarc aSchool of Computing and Information Systems, University of Melbourne
bDepartment of Statistical Science, Duke University
cMethodology Division, Australian Bureau of Statistics
September 16, 2019

Appendix A Derivation of the posterior distribution

Here we sketch the derivation of the joint posterior distribution over the unobserved variables conditioned on the observed record attributes , which is given in Equation (12) of the paper. First we read the factorization off the plate diagram in Figure 1, together with the conditional dependence assumptions detailed in Section 3.2 of the paper. We obtain the following expression, up to a normalisation constant:

Ideally, we’d like to marginalize out all variables except and (the variables of interest), however this is not tractable analytically. Fortunately, we can marginalize out the missing record attributes which yields Equation (12) from the paper:

We can expand this further by substituting the conditional distributions given in Section 3.2 of the paper. This yields:

(S1)

Appendix B Equivalence of d-blink and blink

In this section, we present proofs of Propositions 1 and 2, which show that the inferences we obtain from d-blink are equivalent to those we would obtain from blink under certain conditions.

b.1 Proof of Proposition 1: equivalence of distance/similarity representations

It is straightforward to show that as defined in Equation (10) of the paper satisfies the requirements of Definition 3.3. All that remains is to show that the two parameterizations of the distortion distribution are equivalent. Beginning with as parameterized in blink, we substitute Equation (10) and observe that

This is identical to our parameterization in Equation (9). ∎

b.2 Proof of Proposition 2: equivalence of d-blink and blink

Given that

  • Proposition 1 holds,

  • the distortion hyperparameters are the same for all attributes, and

  • all record attributes are observed,

the only factor in the posterior that differs from blink is:

(S2)

Substituting the density for the conditional distributions for a single factor yields:

Putting this in Equation (S2) and marginalizing over we obtain: