dblink: Distributed EndtoEnd Bayesian Entity Resolution
^{b}Department of Statistical Science, Duke University
^{c}Department of Statistics, Colorado State University
^{d}Methodology Division, Australian Bureau of Statistics
September 16, 2019
Abstract
Entity resolution (ER) (record linkage or deduplication) is the process of merging together noisy databases, often in the absence of a unique identifier. A major advancement in ER methodology has been the application of Bayesian generative models. Such models provide a natural framework for clustering records to unobserved (latent) entities, while providing exact uncertainty quantification and tight performance bounds. Despite these advancements, existing models do not scale to realisticallysized databases (larger than 1000 records) and they do not incorporate probabilistic blocking. In this paper, we propose “distributed Bayesian linkage” or dblink—the first scalable and distributed endtoend Bayesian model for ER, which propagates uncertainty in blocking, matching and merging. We make several novel contributions, including: (i) incorporating probabilistic blocking directly into the model through auxiliary partitions; (ii) support for missing values; (iii) a partiallycollapsed Gibbs samper; and (iv) a novel perturbation sampling algorithm (leveraging the VoseAlias method) that enables fast updates of the entity attributes. Finally, we conduct experiments on five data sets which show that dblink can achieve significant efficiency gains—in excess of 300—when compared to existing nondistributed methods.
AppReferences
Keywords: auxiliary variable, distributed computing, Markov chain Monte Carlo, partiallycollapsed Gibbs sampling, record linkage
missingFNUM@SECTION Introduction
Entity resolution (ER) (record linkage and deduplication) is the process of identifying and merging records across data sources that refer to the same underlying entities. In the absence of unique identifiers (e.g. social security numbers), entities are resolved by searching for clusters of records that have sufficiently similar attributes. This can be a challenging task due to disparities between data sources, poor data quality and temporal variation.
Much of the ER literature advocates a pipeline approach (christen_data_2012; dong_big_2015) comprising four main stages. In the first stage, known as attribute/schema alignment, a set of common attributes are identified among the data sets. In the second stage, known as blocking or indexing, the set of all record pairs is pruned to yield a smaller set of candidate matches. This set is refined in the third stage, known as matching or scoring, typically by thresholding a weighted sum of pairwise attribute similarities. Finally, in the fourth stage, known as merging or fusion, the matching record pairs are mapped to consistent clusters, which are then merged to produce single representative records.
While a pipeline approach can be flexible and efficient at scale, it suffers from a major deficiency: uncertainties encountered at each stage are not propagated through the pipeline, nor are they included in the final output. As a result, an incorrect decision at an earlier stage (such as a poor blocking design) cannot be corrected at later stages. In short, information is lost and accuracy may suffer. Furthermore, analysts are unable to incorporate ER uncertainty in a postER analysis, e.g. an inferential or prediction task.
Bayesian modeling offers a promising solution to the problem of rigorous uncertainty propagation. Indeed, Bayesian modeling for ER has been an active area of research in recent years (copas_record_1990; belin_1995; larsen_2001; tancredi_hierarchical_2011; zhao_bayesian_2012; gutman_bayesian_2013; sadinle_detecting_2014; steorts_entity_2015; steorts_bayesian_2016; zanella_flexible_2016; sadinle_bayesian_2017). In addition to providing a principled treatement of uncertainty, the Bayesian paradigm allows for flexible modeling assumptions, including the incorporation of prior information. The latter can provide a regularizing effect which is useful when little or no training data is available. Recent work has also shown that theoretical performance bounds can be obtained for Bayesian ER models (steorts_performance_2017), whereas counterparts for rulebased and discriminative methods seem implausible.
Despite the successes of Bayesian models for ER, we are not aware of any models that scale to realisticallysized databases (larger than records). In conventional ER pipelines, scalability is addressed at the blocking/indexing stage, where a large fraction of the possible record comparisons are eliminated (christen_data_2012). “Traditional blocking” (steorts_comparison_2014) has been applied in an adhoc fashion to Bayesian ER models (steorts_bayesian_2016), however it does not preserve the posterior distribution, nor does it propagate uncertainty between the blocking and matching stages.
In this paper, we propose a scalable and distributed extension to the blink model for endtoend Bayesian ER (steorts_entity_2015), which we call “distributed blink” or dblink. Our key insight is an auxiliary variable representation that induces a partitioning over the entities and records, while preserving the marginal posterior. The partitions play a similar role as traditional blocks, however the assignment of entities and records to partitions is probabilistic and is inferred jointly with the other variables. This representation enables distributed inference at the partition level, as variables in distinct partitions become conditionally independent. The fact that this is possible gives the blink model a clear advantage in terms of scalability. It is unclear whether a similar approach can be adapted to other Bayesian ER models with more complex (e.g. nonparametric) linkage priors.
To construct the partitions, we propose a method based on d trees which copartitions similar entities, while achieving wellbalanced partitions. We make several other novel contributions including support for record values missing completely at random; a partiallycollapsed Gibbs sampler that balances constraints imposed by distributed computing; and insights into improving computational efficiency. The latter insights include: (i) a subquadratic algorithm for updating links based on indexing; (ii) a truncation approximation for the attribute similarities; and (iii) a novel perturbation sampling algorithm for updating the entity attributes, which leverages the VoseAlias method (vose_linear_1991).
We implement our proposed methodology as an opensource Apache Spark package and conduct empirical evaluations on two synthetic and three real data sets. One of the synthetic data sets—RLdata10000—mimics administrative data with distortion and duplication and has been commonly used as a benchmark in the record linkage literature. The second synthetic data set (ABSEmployee) is motivated by real linkages of employment survey data undertaken at the Australian Bureau of Statistics (ABS). It is not derived from real data due to strict confidentiality restrictions. We include this data set in the supplementary material so that others may use it for evaluation of data linkage methods.
Additionally, we apply our methodology to three real applications of entity resolution, each of which is an important and challenging problem. First we link two snapshots of the North Carolina Voter Registration (NCVR) data set. It is necessary to identify duplicates within and across snapshots before computing statistics or performing prediction or inferential tasks. The second application is from the medical field, where it is common for records to be duplicated across locallymanaged databases. Specifically, we examine deidentified person data from a longitudinal study called the National Long Term Care Survey (NLTCS). There are duplicates across all three waves of the study, which must be removed before any analysis can be performed. The third data set is derived from the 2008 and 2010 waves of the Survey of Household Income and Wealth (SHIW). It is a deidentified categorical data set and duplicates must be removed to draw valid statistical conclusions. We choose to work with three very different real applications to understand the scalability and accuracy of our methods, in cases where ground truth is known.
Our computational and sampling improvements can yield efficiency gains in excess of 300 compared to blink, and we observe nearlinear gains with increasing numbers of partitions. Crucially, our results indicate that these efficiency dividends do not come at the cost of accuracy.
missingFNUM@SECTION Related work
The related work spans three key areas: ER methodology, inference for Bayesian ER models, and distributed MCMC.
Entity resolution methodology.
Matching pairs of tables was first addressed with a probabilistic approach by newcombe_automatic_1959. fellegi_theory_1969 later formalized this approach within the framework of decision theory. Many extensions to the FellegiSunter (FS) model have been proposed (for surveys, see winkler_overview_2006; winkler2014matching), including a recent extension to multiple databases (sadinle_generalized_2013). However, the FS model has been criticized due to its lack of support for duplicates within databases; misspecified independence assumptions; and its dependence on subjective thresholds (tancredi_hierarchical_2011). These limitations have prompted development of more sophisticated Bayesian generative models, including models for bipartite matching (copas_record_1990; belin_1995; larsen_2001; tancredi_hierarchical_2011; gutman_bayesian_2013; sadinle_bayesian_2017), deduplication (sadinle_detecting_2014; steorts_entity_2015; steorts_bayesian_2016) and matching across multiple databases (steorts_entity_2015; steorts_bayesian_2016). All of these models operate on structured data, and most (copas_record_1990; belin_1995; larsen_2001; tancredi_hierarchical_2011; gutman_bayesian_2013; steorts_bayesian_2016) rely solely on binary attributelevel comparisons to inform matching. Recent models (steorts_entity_2015; sadinle_detecting_2014; sadinle_bayesian_2017) additionally utilize attributelevel similarity scores.
Apart from these advances in Bayesian approaches to matching (largely undertaken in statistics), there have been an abundance of contributions from the database and machine learning communities (see surveys by getoor_entity_2012; christen_data_2012). Their focus has typically been on rulebased approaches (fan_reasoning_2009; singh_generating_2017), supervised learning approaches (mudgal_deep_2018), hybrid humanmachine approaches (wang_crowder:_2012; gokhale_corleone:_2014), and scalability (papadakis_comparative_2016). Broadly speaking, all of these approaches rely on either humans intheloop or large amounts of labelled training data, which is not generally the case in the Bayesian setting.
Inference for Bayesian ER models.
Most prior work on Bayesian generative models for ER (e.g. tancredi_hierarchical_2011; gutman_bayesian_2013; steorts_entity_2015) has relied on Gibbs sampling for inference. Compared to other Markov chain Monte Carlo (MCMC) algorithms, Gibbs sampling is relatively easy to implement, however it may suffer from slow convergence and poor mixing owing to its highly local moves (liu_monte_2004).
In the broader context of clustering models, the splitmerge algorithm (jain_splitmerge_2004) has been proposed as an alternative to Gibbs sampling. It traverses the space of clusterings through proposals that split single clusters or merge pairs of clusters, and is thus less susceptible to becoming trapped in local modes. steorts_bayesian_2016 applied this algorithm to an ER model similar to blink. However, it remains unclear whether the splitmerge algorithm is efficient for ER problems, where the number of clusters is expected to grow almost linearly in the number of records (zanella_flexible_2016).
More recently, zanella_flexible_2016 proposed the chaperones algorithm for inference in microclustering models. It is similar in spirit to the splitmerge algorithm, however it is based on restricted Gibbs sampling rather than the MetropolisHastings framework. It “focuses on cluster reassignments with higher probabilities” by maintaining a biased distribution over the space of record pairs. However, its scalability is likely to be limited as a result.
In this work, we apply partiallycollapsed Gibbs sampling (dyk_partially_2008), as we expect it will be more efficient for ER models than the aforementioned methods, while still yielding an improvement over regular Gibbs sampling.
Parallel/distributed MCMC.
Recent literature has focused on using parallel and distributed computing to scale up MCMC algorithms, where applications have included Bayesian topic models (newman_distributed_2009; smola_architecture_2010; ahn_distributed_2014) and mixture models (williamson_parallel_2013; chang_parallel_2013; lovell_clustercluster:_2013; ge_distributed_2015). We review the application to mixture models, as they are conceptually similar to ER models.
Existing work has concentrated on Dirichlet process (DP) mixture models and Hierarchical DP mixture models. The key to enabling distributed inference for these models is the realization that a DP mixture model can be reparameterized as a mixture of DPs. Put simply, the reparameterized model induces a partitioning of the clusters, such that clusters assigned to distinct partitions are conditionally independent. As a result, variables within partitions can be updated in parallel.
williamson_parallel_2013 exploited this idea at the thread level to parallelize inference for a DP mixture model. chang_parallel_2013 followed a similar approach, but included an additional level of parallelization within partitions using a parallelized version of the splitmerge algorithm. Others (lovell_clustercluster:_2013; ge_distributed_2015) have developed distributed implementations in the MapReduce framework.
We do not consider DP mixture models in our work, as their behaviour is illsuited for ER applications.^{1}^{1}1With a DP prior, the number of clusters grows logarithmically in the number of records, but empirical observations call for nearlinear growth (zanella_flexible_2016). However we do borrow the reparameterization idea, albeit with a more flexible partition specification which permits similar entities to be copartitioned and facilitates load balancing. It would be interesting to see whether similar ideas can be applied to microclustering models (zanella_flexible_2016), however preserving the marginal posterior distribution seems challenging in this case.
missingFNUM@SECTION A scalable model for Bayesian ER
We now present our extension to the blink model for Bayesian ER (steorts_entity_2015), which incorporates auxiliary partitions, support for missing values, and generic attribute similarity functions. We describe notation and assumptions in Section 3.1, before outlining the generative process in Section 3.2. Attribute similarity measures are defined in Section 3.3, including a truncation approximation which improves scalability. In Section 3.4, we demonstrate that the marginal posterior of dblink is equivalent to blink under certain conditions. Finally, in Section 3.5 we provide intuition for the auxiliary partition representation.
3.1 Notation and assumptions
Consider a collection of tables^{2}^{2}2We define a table as an ordered (indexed) collection of records, which may contain duplicates (records for which all attributes are identical). (databases) indexed by , each with records (rows) indexed by and aligned attributes (columns) indexed by . We assume each record links to a single entity, denoted by , from a fixed population of entities of size indexed by . The value of the th attribute for record is denoted by , and is assumed to be a noisy observation of the linked entity’s true attribute value . We allow for the fact that some attributes may be missing completely at random through a corresponding indicator variable (little_statistical_2002, p. 12).
Table 1 summarizes our notation, including modelspecific variables which will be introduced shortly. We adopt the following rules to compactly refer to sets of variables:

A boldface lowercase variable denotes the set of all attributes: e.g. .

A boldface capital variable denotes the set of all index combinations: e.g. .
We also define notation to separate the record attributes into an observed part (those ’s for which ) and a missing part (those ’s for which ).
After specifying a model, our goal is to perform endtoend ER, by inferring the joint posterior distribution over the links and true entity attribute values . We condition only on the observed record attribute values . In other words, we operate in a fully unsupervised setting, assuming no ground truth data is available for the links or entities.
Note that inferring is equivalent to the matching stage of the ER pipeline, while inferring is equivalent to the merging stage. By inferring and jointly, we are able to propagate uncertainty between the two stages. We’ll later show how to incorporate probabilistic blocking, while allowing full propagation of uncertainty.
3.2 Model specification
We now describe the generative process for our proposed model dblink. A plate diagram is depicted in Figure 1, with key differences from blink highlighted in a dashed blue line style.
Symbol  Description  Symbol  Description 

index over tables  assigned partition for record in table  
index over records in table  assigned entity for record in table  
index over entities  prob. attribute in table is distorted  
index over partitions  distortion hyperparams. for attribute  
index over attributes  prob. attribute in table is observed  
index over domain of attribute  domain of attribute  
total number of records  distribution over domain of attribute  
attribute for record in table  similarity measure for attribute  
distortion indicator for  set of records assigned to entity  
observed indicator for  set of entities assigned to partition  
attribute for entity  partition assignment function 
Entities.
Each entity in the population (of fixed size ) is described by “true” attributes. The value of attribute for entity is denoted by and is assumed to be drawn independently from the domain of the attribute according to the empirical distribution (derived from the tables) :
(1) 
Partitions.
dblink deviates from blink by splitting the entities into partitions^{3}^{3}3In the database and distributed computing communities, partition may refer to a part of divided object. We use the term in this sense, which differs from the usage in set theory. according to their attribute values. The assignment of entities to partitions is achieved through a deterministic partition function:
(2) 
where is the Kronecker product. Further details are given in Section 4, along with an example based on d trees. We also introduce the notation , which allows us to concisely refer to the set of entities assigned to partition .
Distortion.
Associated with each table and attribute is a distortion probability , with assumed prior distribution:
(3) 
where and are hyperparameters. We provide recommendations for setting and in Appendix F. The distortion probabilities feed into the recordgeneration process below.
Records.
We assume the records are generated by selecting an entity uniformly at random and copying the entity’s attributes subject to distortion. The process for generating record in table is outlined below. Steps (i), (ii), and (v) deviate from blink.

Choose a partition assignment at random in proportion to the partition sizes:
(4) 
Choose an entity assignment uniformly at random from partition :
(5) 
For each attribute , draw a distortion indicator :
(6) 
For each attribute , draw a record value :
(7) where represents a point mass. If , is copied directly from the entity. Otherwise, is drawn from the domain according to the distortion distribution . In the literature, this is known as a hitmiss model (copas_record_1990).

For each attribute , draw an observed indicator :
(8) If , is observed, otherwise it is missing.
Detail on the distortion distribution.
chooses a distorted value for attribute conditional on the true value . In our parameterization of the model, it is defined as
(9) 
where is a normalization constant and is the similarity measure for attribute (see Section 3.3). Intuitively, this distribution chooses values in proportion to their empirical frequency, while placing more weight on those that are “similar” to . This reflects the notion that distorted values are likely to be close to the truth, as is the case when modeling typographical errors.
3.3 Attribute similarity measures
We now discuss the attribute similarity measures that appear in the distortion distribution of Equation (9). The purpose of these measures is to quantify the propensity that some value in the attribute domain is chosen as a distorted alternative to the true value .
Definition (Attribute similarity measure).
Let be the domain of an attribute. An attribute similarity measure on is a function that satisfies and for all .
Note that our parameterization in terms of attribute similarity measures differs from blink, which uses distance measures. This allows us to make use of a more efficient sampling method, as described in Section 6.3. The next proposition states that the two parameterizations are in fact equivalent, so long as the distance measure is bounded and symmetric (a proof is provided in Appendix B.1).
Proposition 1.
Let be the attribute distance measure that appears in blink, and assume that and for all . Define the corresponding attribute similarity measure for dblink as
(10) 
Then the parameterization of used in dblink is equivalent to blink.
In this paper, we restrict our attention to the following similarity measures for simplicity:

Constant similarity measure. This measure is appropriate for categorical attributes, where there is no reason to believe one value is more likely than any other as a distortion to the true value . Without loss of generality, it may be defined as for all .

Normalized edit similarity measure. This measure is based on the edit distance metric, and is suitable for modeling distortion in generic string attributes. Following yujian_normalized_2007, we define a normalized edit distance metric,
where denotes the regular edit distance and denotes the length of string . Note that alternative definitions of the normalized edit distance could be used (see references in yujian_normalized_2007), however the above definition is unique in that it yields a proper metric. Since the normalized edit distance is bounded on the interval we can define a corresponding normalized edit similarity measure:
(11)
Ideally, one should select attribute similarity measures based on the data at hand. There are many possibilities to consider, such as Jaccard similarity, numeric similarity measures (lesot_similarity_2008) and other domainspecific measures (bilenko_adaptive_2003).
3.4 Posterior distribution and equivalence
Having described the dblink model, we now consider the joint posterior distribution over the latent (unobserved) parameters. By reading the conditional dependence structure from the plate diagram (Figure 1) and marginalizing over the missing record attributes , one can show that the posterior distribution is of the following form:
(12) 
For further detail on the derivation and an expanded form of the posterior, we refer the reader to Appendix A.
The structure of the posterior for dblink is very similar to the one for blink, apart from the factors for and and the introduction of the “observed” indicator variables (as highlighted in Figure 1). In fact, one can show that dblink is equivalent to blink in the following sense:
Proposition 2.
Suppose the conditions of Proposition 1 hold and that and for all . Assume furthermore that all record attributes are observed, i.e. for all . Then the marginal posterior of , , and for dblink (i.e. marginalized over ) is identical to the posterior for blink.
A proof is provided in Appendix B.2.
3.5 Rationale for introducing partitions
We now briefly explain the role of the auxiliary partitions in dblink. Firstly, we note that without the partitions (), the Markov blanket for includes the attribute values for all of the entities . This presents a major obstacle when it comes to distributing the inference on a compute cluster, as the data is not separable. By incorporating partitions, we restrict the Markov blanket for to include only a subset of the entity attribute values—those in the same partition as record . As a result, it becomes natural to distribute the inference so that each compute node is responsible for a single partition (see Section 5.2 for details). Secondly, we can interpret the partitions as performing a probabilistic blocking in the context of MCMC sampling (introduced in Section 5), which improves computational efficiency. In a given iteration, the possible links for a record are restricted to the entities residing in the same partition. However, unlike conventional blocking, the partition assignments are not fixed—between iterations the entities and linked records may move between partitions.
missingFNUM@SECTION Partition functions
In Section 3.2 we defined a generic partition function (Equation 2) that is responsible for assigning entities to partitions. This function may be regarded as a free parameter, since it has no bearing on the model equivalence by Theorem 2. However, from a practical perspective it ought to be chosen carefully, as it can impact the efficiency of the inference. We suggest some guidelines for choosing a partition function in Section 4.1, before presenting an example based on d trees in Section 4.2.
4.1 Interpretation and guidelines
Recall that the partition function assigns an entity to a partition according to its attributes . Since is unobserved, it must be treated as a random variable over the space of possible attributes . This means the partition function should not be interpreted as partitioning the entities directly. Rather, it should be interpreted as partitioning the space in which the entities reside, while taking the distribution over into account. With this interpretation in mind, we argue that the partition function should ideally satisfy the following properties:

Balanced weight. The partitions should have equal weight (probability mass) under the distribution over . This ensures the expected number of entities assigned to each partition is equal, which results in proper load balancing.

Entity separation. A pair of entities drawn at random from the same partition should have a high degree of similarity, while entities drawn from different partitions should have a low degree of similarity. This improves the likelihood that similar records will end up in the same partition, and allows them to more readily form likely entities.
These properties need not be satisfied strictly: the extent to which they are satisfied is merely expected to improve the efficiency of the inference. For example, satisfying the first property requires knowledge of the marginal posterior distribution over , which is infeasible to calculate. We note that there is likely to be tension between the two properties, so that a balance must be struck between them.
4.2 Example: kd tree partition function
We now describe a partition function based on d trees, which is used in our experiments in Section 7.
Background.
A d tree is a binary tree that recursively partitions a dimensional affine space (bentley_multidimensional_1975; friedman_algorithm_1977). In the standard setup, each node of the tree is associated with a data point that implicitly splits the input space into two halfspaces along a particular dimension. Owing to its ability to hierarchically group nearby points, it is commonly used to speed up nearestneighbor search. This makes a d tree a good candidate for a partition function, since it can be balanced while grouping similar points.
Setup.
Our setup differs from a standard d tree in several aspects. First, we consider a discrete space (not an affine space), where the “ dimensions” are the attributes. Second, we do not store data points in the tree. We only require that the tree implicitly stores the boundaries of the partitions, so that it can assign an arbitrary to the correct partition (a leaf node). Finally, since we are working in a discrete space, the input space to a node is a countable set. The node must split the input set into two parts based on the values of one of the attributes.
Fitting the tree.
Since it is infeasible to calculate the marginal posterior distribution over exactly, we use the empirical distribution from the tables as an approximation. As a result, we treat the records (tables) as a sample from the distribution over , and fit the tree so that it remains balanced with respect to this sample. The depth of the tree determines the number of partitions ().
Achieving balanced splits.
When fitting the tree, each node receives an input set of samples and a rule must be found that splits the set into two roughly equal (balanced) parts based on an attribute. We consider two types of splitting rules: the ordered median and the reference set (see Appendix C). We allow the practioner to specify an ordered list of attributes to be used for splitting. To ensure balanced splits, we recommend selecting attributes with a large domain. If possible, we recommend preferencing attributes which are known a priori to be reliable (low distortion), as this will reduce the shuffling of entities/records between partitions. In principle, it is possible to automate the process of fitting a tree: one could grow several trees with randomlyselected splits and use the one that is most balanced. We examine balance empirically in Appendix H.
missingFNUM@SECTION Inference
We now turn to approximating the full joint posterior distribution over the unobserved variables , , , and , as given in Equation (12). Since it is infeasible to sample from this distribution directly, we use an MCMC algorithm called partiallycollapsed Gibbs (PCG) sampling (dyk_partially_2008). In addition, we show how to exploit the conditional independence induced by the partitions to distribute the PCG sampling across multiple threads or machines.
5.1 Partiallycollapsed Gibbs sampling
Following the original blink paper (steorts_entity_2015), we initially experimented with regular Gibbs sampling. However, the resulting Markov chains exhibited slow convergence and poor mixing. This is a known shortcoming of Gibbs sampling which may be remedied by collapsing variables and/or updating correlated variables in groups (liu_monte_2004). These ideas form the basis for a framework called partiallycollapsed Gibbs (PCG) sampling—a generalization of Gibbs sampling that has “much better convergence properties” (dyk_partially_2008).
Under the PCG framework, variables are updated in groups by sampling from their conditional distributions. These conditional distributions may be taken with respect to the joint posterior (like regular Gibbs), or with respect to marginal distributions of the joint posterior (unique to PCG). The latter case is called trimming and must be handled with care so as not to alter the stationary distribution of the Markov chain.
In applying PCG sampling to dblink, we must decide how to apply the three tools: marginalization (equivalent to grouping), permutation (changing the order of the updates) and trimming (removing marginalized variables). In theory, the convergence rate should improve with more marginalization and trimming, however this must be balanced with the following: (i) whether the resulting conditionals can be sampled from efficiently, and (ii) whether the resulting dependence structure is compatible with our distributed setup (see Section 5.2). We consider two samplers, PCGI and PCGII, described below. Of the two, we recommend PCGI as it is more efficient in our empirical evaluations (see Section 7.1). We include the PCGII sampler, as one would expect the PCGII sampler to perform better than the PCGI sampler in terms of mixing, however when computational efficiency is taken into account the performance is worse (see Figure 6).
5.1.1 PCGI sampler
The PCGI sampler uses regular Gibbs updates for , and for all , and . The conditional distributions for these updates are listed in Appendix D. When updating the entity attributes and the record partition assignments , marginalization and trimming are used. Specifically, we apply marginalization by jointly updating and (the set of ’s and ’s for records linked to entity ). We then trim (analytically integrate over) .
We shall now derive this update. Referring to Equation (12), the joint posterior of , conditioned on the other parameters has the form
where superscript denotes exclusion of any records (those currently linked to entity ). Substituting the distributions and trimming yields
(13) 
where
and
Note that the update for is deterministic, conditional on and .
Since we have applied trimming, we must permute the updates so that the trimmed variables are not conditioned on in later updates. This means the updates for and must come after the updates for and , but before the updates for .
5.1.2 PCGII sampler
The PCGII sampler is identical to PCGI, except that it replaces the regular Gibbs update for with an update that marginalizes and trims . To derive the distribution for this update, we first consider the joint posterior of and conditioned on the other parameters:
where superscript denotes exclusion of record . Substituting the distributions and trimming yields
(14) 
Update variables  Dependencies 

, , , ,  
,  , , 
, , , 
5.2 Distributing the sampling
By examining the conditional distributions derived in the previous section and those listed in Appendix D, one can show that the updates for the variables associated with entities and records (, , and ) only depend on variables associated with entities and records assigned to the same partition (excluding ). These dependencies are summarized in Table 2 for the PCGI sampler. The distortion probability is an exception—it is not associated with any partition and may depend on ’s across all partitions.
This dependence structure—in particular, the conditional independence of entities and records across partitions—makes the PCG sampling amenable to distributed computing. As such, we propose a managerworker architecture where:

the manager is responsible for storing and updating variables not associated with any partition (i.e. ); and

each worker represents a partition, and is responsible for storing and updating variables associated with the entities and records assigned to it.
The manager/workers may be processes running on a single machine or on machines in a cluster. If using a cluster, we recommend that the nodes be tightly coupled, as frequent communication between them is required.
Figure 2 depicts a single iteration of PCG sampling using our proposed managerworker architecture. Of the four steps depicted, steps 2 and 3 are the most computationally intensive, so we can potentially achieve a significant speedup by distributing them across the workers. The communication cost is likely to be greatest in step 3, where the entities and linked records are shuffled to their newlyassigned partitions. A wellchosen partition function can minimize this cost. If similar records/entities are copartitioned, there should be little movement between partitions. It is important to note that Figure 2 constitutes a single iteration of PCG sampling, or more precisely, a single application of the Markov transition operator. To generate a Markov chain of length , we begin with some initial state and iteratively apply the transition operator times.
missingFNUM@SECTION Computational efficiency considerations
6.1 Efficient pruning of candidate links
In this section, we describe a trick that is aimed at improving the computational efficiency of the Gibbs update for (used in the Gibbs and PCGI samplers). This particular trick does not apply to the joint PCG update for and (used in the PCGII sampler).
Consider the conditional distribution for the update in Equation (S5) of Appendix D:
(15) 
The support of this distribution is the set of candidate links for record , which we denote by . Looking at the first indicator function above, we see that , i.e. the candidate links are restricted to the entities in the same partition as record . Thus, a naïve sampling approach for this distribution takes time.
We can improve upon the naïve approach by exploiting the fact that is often considerably smaller than . To see why this is the case, note that the second indicator function in Equation (15) further restricts if any of the distortion indicators for the observed record attributes are zero. Specifically, if and , cannot contain any entity whose th attribute does not match the record’s th attribute . This implies is likely to be small in the case of low distortion.
Putting aside the computation of for the moment, this means we can reduce the time required to update to . To compute efficiently, we propose maintaining an inverted index over the entity attributes within each partition. Specifically, the index for the th attribute in partition should accept a query value and return the set of entities that match on :
(16) 
Once the index is constructed, we can efficiently retrieve the set of candidate links for record by computing a multiple set intersection:
(17) 
This assumes at least one of the observed record attributes is not distorted. Otherwise .
Since the sizes of the sets are likely to vary significantly, we advise computing the intersection iteratively in increasing order of size. That is, we begin with the smallest set and retain the elements that are also in the next largest set, and so on. With a hashbased set implementation, this scales linearly in the size of the first (smallest) set.
6.2 Caching and truncation of attribute similarities
We have not yet emphasized that the updates for , and depend on the attribute similarities between pairs of values in the attribute domains. Specifically, for each attribute , we need to access the indexed set . These similarities may be expensive to evaluate onthefly, so we cache the results in memory on the workers.
To manage the quadratic scaling of , and in anticipation of another trick introduced in Section 6.3, we transform the similarities so that those below a cutoff are regarded as completely disagreeing. We achieve this by applying the following truncation transformation to the raw attribute similarity :
(18) 
as illustrated in Figure 3.
Whenever a raw attribute similarity is called for, we replace it with this truncated version. Only pairs of values with positive truncated similarity are stored in the cache—those not stored in the cache have a truncated similarity of zero by default. Note that attributes with a constant similarity function are treated specially—there is no need to cache the index set of similarities, since they are all identical.
It is important to acknowledge that the truncated similarities are an approximation to the original model. We claim that the approximation is reasonable on the following grounds:

Low loss of information. Below a certain cutoff, the attribute similarity function is unlikely to encode much useful information for modeling the distortion process. For example, the fact that whereas , doesn’t necessarily suggest that “Chiu” is more likely than “Chen” as a distorted alternative to “Smith”.

Precedent. In the record linkage literature, value pairs with similarities below a cutoff are regarded as completely disagreeing (winkler_methods_2002; enamorado_using_2017).

Efficiency gains. As we shall soon see in Section 6.3, we can perform the combined , , update more efficiently by eliminating pairs below the cutoff from consideration.
6.3 Fast updates of entity attributes using perturbation sampling
We now present a novel sampling algorithm that allows us to efficiently perform the PCG update for and . The algorithm relies on the observation that the conditional distribution for can be expressed as a mixture over two components:

a base distribution over which is ideally constant for all entities; and

a perturbation distribution which varies for each entity, but has a much smaller support than .
With this representation, we can avoid computing and sampling from the full distribution over , which varies for each update. Rather, we only need to compute the perturbation distribution over a much smaller support, and then sample from the mixture, which can be done efficiently using the VoseAlias method (vose_linear_1991). We refer to this algorithm as perturbation sampling.
6.3.1 Perturbation sampling
Although we’re interested in applying perturbation sampling to a specific conditional distribution, we describe the idea in generality below.
Consider a target probability mass function (pmf) with finite support , which varies as a function of parameters . In general, one must recompute the probability tables to draw a new variate whenever changes—a computation that takes time. However, if the dependence on is of a certain restricted form, we show that it is possible to achieve better scalability by expressing the target as a mixture. This is made precise in the following result.
Proposition 3.
Let be a pmf with finite support , which depends on parameters . Suppose there exists a “base” pmf over which is independent of and a nonnegative bounded perturbation term , such that can be factorized as . Then can be expressed as a mixture over the base pmf and a “perturbation” pmf over as follows:
(19) 
where .
Proof.
The result is straightforward to verify by substitution. ∎
Algorithm S1 (in Appendix E) shows how to apply this result to draw random variates from a target pmf. Briefly, it consists of three steps: (i) the perturbation pmf and its normalization constant are computed; (ii) a biased coin is tossed to choose between the base pmf and the perturbation pmf ; and (iii) a random variate is drawn from the selected pmf. If is selected, a preinitialized Alias sampler is used to draw the random variate (reused for all ). Otherwise if is selected, a new Alias sampler is instantiated. The result below states the time complexity of this algorithm (see Appendix E for a proof).
Proposition 4.
This is a promising result, since the size of the perturbation support is typically of order 10 for our application, while the size of the full support may be as large as . Hence, we expect a significant speedup over the naïve approach.
6.3.2 Application of perturbation sampling
We now return to our original objective: performing the joint PCG update for and . Referring to Equation (13), we can express the conditional distribution for (i.e. the target distribution) as
(20) 
The base distribution is given by
(21) 
where is the number of records linked to entity with observed values for attribute ; and the perturbation term is given by
(22) 
The full support of the target pmf is , while the perturbation support is given by
In words, this set consists of the observed values for attribute in the records linked to entity , plus any sufficiently similar values from the attribute domain (for which the truncated similarity is nonzero). The size of the perturbation set will vary depending on the cutoff used for the truncation transformation—the higher the cutoff, the smaller the set. This implies that there is a tradeoff between efficiency (small perturbation set) and accuracy (lower cutoff).
Remark.
The astute reader may have noticed that the base distribution given in Equation (21) is not completely independent of the conditioned parameters, as is required by Proposition 3. In particular, depends on —roughly the size of entity . Fortunately, we expect the range of regularly encountered entity sizes to be small, so we sacrifice some memory by instantiating multiple Alias samplers for each in some expected range. In the worst case, when a value is encountered outside the expected range and the base distribution is required (unlikely since the weight on the base component is typically small), we instantiate the base distribution onthefly (same asymptotic cost as the naïve approach).
missingFNUM@SECTION Empirical evaluation
Data set  # records ()  # files ()  # entities  # attributes ()  

categorical  string  
ABSEmployee  600,000  3  400,000  4  0 
NCVR  448,134  2  296,433  3  3 
NLTCS  57,077  3  34,945  6  0 
SHIW0810  39,743  2  28,584  8  0 
RLdata10000  10,000  1  9,000  2  3 
We evaluated dblink using two synthetic and three real data sets, as summarized in Table 3. All results presented here were obtained using our local server in pseudocluster mode, however some were replicated on a cluster in the Amazon public cloud (see Appendix G) to test the effect of higher communication costs. Further details about the data sets, hardware, implementation and parameter settings are provided in Appendix F.
7.1 Computational and sampling efficiency
Following turek_efficient_2016, we measured the efficiency of dblink using the rate of effective samples produced per unit time (ESS rate), which balances sampling efficiency (related to mixing/autocorrelation) and computational efficiency. We used the mcmcse R package (mcmcse) to compute the effective sample size (ESS), which implements a multivariate method proposed by vats_multivariate_2015.
Since the number of variables in the model is unwieldy (there are at least unobserved variables) we computed the ESS for the following summary statistics:

the number of observed entities (scalar);

the aggregate distortion for each attribute (vector); and

the cluster size distribution (vector containing frequency of 0clusters, 1clusters, 2clusters, etc.).
dblink versus blink.
We compared dblink (using the PCGI sampler) to our own implementation of blink (i.e. a Gibbs sampler without any of the tricks described in Section 6). For a fair comparison, we switched off partitioning in dblink. We used the relatively small RLdata10000 data set, as blink cannot cope with larger data sets. Figure 4 contains trace plots for two summary statistics as a function of running time. It is evident that blink has not converged to the equilibrium distribution within the allotted time of 11 hours, while dblink converges to equilibrium in 100 seconds. Looking solely at the time per iteration, dblink is at least 200 faster than blink.
Partitioning and efficiency.
We tested the effect of varying the number of partitions on the efficiency of dblink. For each value of , we computed the ESS rate averaged over 3000 iterations. We used the NLTCS data set and the PCGI sampler. Figure 5 presents the results in terms of the speedup relative to the ESS rate for . We observe a nearlinear speedup in , with the exception of . The speedup is expected to taper off with increasing numbers of partitions, as parallel gains in efficiency are overcome by losses due to communication costs and/or poorer mixing. This tipping point seems difficult to predict for a given set up, as it depends on complex factors such as the data distribution, the splitting rules used, and the hardware characteristics.
Sampling methods and efficiency.
We evaluated the efficiency of the three samplers introduced in Section 5.1 (Gibbs, PCGI and PCGII). As above, we computed the ESS rate as an average over 3000 iterations. We set and used the NLTCS data set. The results, shown in Figure 6, indicate that the PCGI sampler is considerably more efficient (by a factor of 1.5–2) than the baseline Gibbs sampler for this data set. We also observe that the PCGII sampler performs quite poorly in comparison: between 20–30 slower than the Gibbs sampler. This is because the marginalization and trimming for the update for PCGII prevents us from applying the trick described in Section 6.1. Thus although PCGII is expected to be more efficient in terms of reducing autocorrelation, it is less efficient overall as each iteration is too computationally expensive.
7.2 Linkage quality
Though not our primary focus, we assessed the performance of dblink in terms of its predictions for the linkage structure (the matching step) for the data sets in Table 3. This was not previously possible with blink, as it could only scale to small data sets of around 1000 records (steorts_entity_2015).
Point estimate methodology.
To evaluate the matching performance of dblink with respect to the ground truth, we extracted a point estimate of the linkage structure from the posterior using the shared most probable maximal matching sets (sMPMMS) method (steorts_bayesian_2016). This method circumvents the problem of label switching (jasra_markov_2005)—where the identities of the entities do not remain constant along the Markov chain.
The sMPMMS method involves two main steps. In the first step, the most probable entity cluster is computed for each record based on the posterior samples. In general, these entity clusters will conflict with one another—e.g. the most probable entity cluster for might be while for it is . The second step resolves these conflicts by assigning precedence to links between records and their most probable entity clusters. The result is a globallyconsistent estimate of the linkage structure—i.e. it satisfies transitivity.
We distributed the computation of the sMPMMS method in the Spark framework. We used approximate posterior samples which were derived from a Markov chain of length by discarding the first iterations as burnin^{4}^{4}4We applied a burnin of 60k iterations for NCVR as it was slow to converge. and applying a thinning interval of 10. These parameters were chosen by inspection of trace plots, some of which are reported in Appendix K. By contrast to the point estimates reported here, we also examined full posterior estimation in Appendix K.
Baseline methods.
We compared the linkage quality of dblink with three baseline methods as described below. We focused on scalable methods as we assumed very little to no training data was available.
Data set  Method  Pairwise measures  Cluster measures  

Precision  Recall  F1score  ARI  Err. # clust.  
ABSEmployee  dblink  0.9763  0.8530  0.9105  0.9105  +1.667% 
FellegiSunter (10)  0.9963  0.8346  0.9083  —  —  
FellegiSunter (100)  0.9963  0.8346  0.9083  —  —  
Near Matching  0.0378  0.9930  0.0728  —  —  
Exact Matching  0.9939  0.8346  0.9074  0.9074  +9.661%  
NCVR  dblink  0.9145  0.9653  0.9392  0.9392  –3.562% 
FellegiSunter (10)  0.9868  0.7874  0.9083  —  —  
FellegiSunter (100)  0.9868  0.7874  0.9083  —  —  
Near Matching  0.9899  0.7443  0.8497  —  —  
Exact Matching  0.9925  0.0017  0.0034  0.0034  +51.09%  
NLTCS  dblink  0.8319  0.9103  0.8693  0.8693  –22.09% 
FellegiSunter (10)  0.9094  0.9087  0.9090  —  —  
FellegiSunter (100)  0.9094  0.9087  0.9090  —  —  
Near Matching  0.0600  0.9563  0.1129  —  —  
Exact Matching  0.8995  0.9087  0.9040  0.9040  +2.026%  
SHIW0810  dblink  0.2514  0.5396  0.3430  0.3429  –37.65% 
FellegiSunter (10)  0.0028  0.9050  0.0056  —  —  
FellegiSunter (100)  0.0025  0.9161  0.0050  —  —  
Near Matching  0.0043  0.9111  0.0086  —  —  
Exact Matching  0.1263  0.7608  0.2166  0.2166  –37.40%  
RLdata10000  dblink  0.6334  0.9970  0.7747  0.7747  –10.97% 
FellegiSunter (10)  0.9957  0.6174  0.7622  —  —  
FellegiSunter (100)  0.9364  0.8734  0.9038  —  —  
Near Matching  0.9176  0.9690  0.9426  —  —  
Exact Matching  1.0000  0.0080  0.0159  0.0159  +11.02% 

Exact Matching. Links records that match on all attributes. It is unsupervised and ensures transitivity.

Near Matching. Links records that match on at least attributes. It is unsupervised, but does not guarantee transitivity.

FellegiSunter. Links records according to a pairwise match score that is a weighted sum of attributelevel dis/agreements. The weights are specified by the FellegiSunter model (fellegi_theory_1969) and were estimated using the expectationmaximization algorithm, as implemented in the RecordLinkage R package (sariyar_recordlinkage_2010). We chose the threshold on the match score to optimize the F1score using a small amount of training data (size 10 and 100). This makes the method semisupervised. Note that the training data was sampled in a biased manner to deal with the imbalance between the matches and nonmatches (half with match scores above zero and half below). The method does not guarantee transitivity.
Results.
Table 4 presents performance measures categorized by data set and method. The pairwise performance measures (precision, recall and F1score) are provided for all methods, however the cluster performance measures (adjusted Rand Index, see vinh_information_2010, and percentage error in the number of clusters) are only valid for methods that guarantee transitivity of closure (dblink and Exact Matching). Despite being fully unsupervised, dblink achieves competitive performance when compared to the semisupervised FellegiSunter method. The two simple baselines, Near Matching and Exact Matching, are acceptable for data sets with low noise but perform poorly otherwise (e.g. NCVR and RLdata10000). We conducted an empirical sensitivity analysis for dblink with respect to variations in the hyperparameters. The results for RLdata10000 (included in Appendix J) show that dblink is somewhat sensitive to all of the hyperparameters tested, however sensitivity is in general predictable, following clear and intuitive trends. One interesting observation is the fact that dblink tends to overestimate the amount of distortion. This is perhaps not surprising given the absence of ground truth.
missingFNUM@SECTION Concluding remarks
We have proposed dblink: a distributed, scalable extension of the blink model for unsupervised Bayesian ER. Our key insight was an auxiliary variable representation of the model that induces a partitioning over the entities and records. This integrates a kind of “blocking step” into the model, which improves scalability and enables distributed inference, while allowing blocking and ER errrors to propagate. We also contributed several ideas for improving the efficiency of the inference, including the use of partiallycollapsed Gibbs sampling, inverted indices for updating links and a novel perturbation sampling algorithm for updating entity attributes. Our experiments show that dblink can achieve significant efficiency gains compared to blink. With additional computing resources, we expect dblink could be scaled to databases with 1–10 million records.
SUPPLEMENTAL MATERIALS
 Code:

An implementation of dblink built on the Apache Spark distributed computing framework (Zip file). The source code is also hosted at https://github.com/ngmarchant/dblink.
 Data:

An archive containing data sets that we have permission to redistribute (Zip file).
ACKNOWLEDGEMENTS
N. Marchant acknowledges the support of an Australian Government Research Training Program Scholarship and the AMSIIntern program hosted by the Australian Bureau of Statistics. R. C. Steorts and A. Kaplan acknowledge the support of NSF SES1534412 and CAREER1652431. B. Rubinstein acknowledges the support of Australian Research Council grant DP150103710. N. Marchant and B. Rubinstein also acknowledge support of Australian Bureau of Statistics project ABS2018.363.
References
Appendices for “dblink: Distributed EndtoEnd Bayesian Entity Resolution”
Neil G. Marchant^{a} Rebecca C. Steorts^{b} Andee Kaplan^{b} Benjamin I. P. Rubinstein^{a} Daniel N. Elazar^{c}
^{a}School of Computing and Information Systems, University
of Melbourne
^{b}Department of Statistical Science, Duke University
^{c}Methodology Division, Australian Bureau of Statistics
September 16, 2019
Appendix A Derivation of the posterior distribution
Here we sketch the derivation of the joint posterior distribution over the unobserved variables conditioned on the observed record attributes , which is given in Equation (12) of the paper. First we read the factorization off the plate diagram in Figure 1, together with the conditional dependence assumptions detailed in Section 3.2 of the paper. We obtain the following expression, up to a normalisation constant:
Ideally, we’d like to marginalize out all variables except and (the variables of interest), however this is not tractable analytically. Fortunately, we can marginalize out the missing record attributes which yields Equation (12) from the paper:
We can expand this further by substituting the conditional distributions given in Section 3.2 of the paper. This yields:
(S1) 
Appendix B Equivalence of dblink and blink
In this section, we present proofs of Propositions 1 and 2, which show that the inferences we obtain from dblink are equivalent to those we would obtain from blink under certain conditions.
b.1 Proof of Proposition 1: equivalence of distance/similarity representations
It is straightforward to show that as defined in Equation (10) of the paper satisfies the requirements of Definition 3.3. All that remains is to show that the two parameterizations of the distortion distribution are equivalent. Beginning with as parameterized in blink, we substitute Equation (10) and observe that
This is identical to our parameterization in Equation (9). ∎
b.2 Proof of Proposition 2: equivalence of dblink and blink
Given that

Proposition 1 holds,

the distortion hyperparameters are the same for all attributes, and

all record attributes are observed,
the only factor in the posterior that differs from blink is:
(S2) 
Substituting the density for the conditional distributions for a single factor yields:
Putting this in Equation (S2) and marginalizing over we obtain: