Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases
Accurate and efficient entity resolution is an open challenge of particular relevance to intelligence organisations that collect large datasets from disparate sources with differing levels of quality and standard. Starting from a first-principles formulation of entity resolution, this paper presents a novel Entity Resolution algorithm that introduces a data-driven blocking and record-linkage technique based on the probabilistic identification of entity signatures in data. The scalability and accuracy of the proposed algorithm are evaluated using benchmark datasets and shown to achieve state-of-the-art results. The proposed algorithm can be implemented simply on modern parallel databases, which allows it to be deployed with relative ease in large industrial applications.
assumption \@definecounterexample \@definecounterdefinition \@definecountertheorem \@definecounterproposition
Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases
|Kee Siong Ng|
|AUSTRAC / ANU|
|Australian National University|
Entity resolution (ER) is the process of identifying records that refer to the same real-world entity. Accurate and efficient ER is needed in various data-intensive applications, including but not limited to health studies, fraud detection, and national censuses [?]. More specifically, ER plays a pivotal role in the context of Australia’s whole-of-government approach to tackle our most pressing social issues—including terrorism and welfare fraud—by combining and analysing datasets from multiple government agencies.
Two typical challenges in entity resolution are imperfect data quality and large data size. Common data quality issues that can introduce ambiguity in the ER process include:
Incompleteness: Records with incomplete attribute values or even missing attribute values.
Incompatible formats: The formats of names, addresses, dates, numbers, etc., can be different between countries, regions, and languages.
Errors: Records containing wrong information due to either user or system errors, or deliberate attempts at obfuscation are widely seen in databases.
Timeliness: Another very common source of error is records that have become outdated due to poor maintenance or data refresh practices, such as people changing their name or address.
In databases containing upwards of tens to hundreds of millions records, ER can also be challenging because exhaustively comparing records in a pairwise manner is computationally infeasible [?]. In fact, any ER algorithm having a time complexity worse than linear is prohibitive on large databases.
In this paper, we present a simple and scalable ER algorithm that addresses the challenges of performing ER on poor quality and high volume data. The key ideas behind our proposed approach are described next.
Using Redundancy to Overcome Data Quality Issues The most common way to tackle data quality issues is to standardise and cleanse raw data before the linking operation [?]. Standardisation and cleansing are umbrella terms covering operations which can fill in incomplete data, unify inconsistent formats, and remove errors in data, i.e., operations addressing all kinds of data quality issues.
The problem with standardisation and cleansing is that it is in itself a challenging problem. For example, 01/02/2000 can be parsed as either 1st of Feb 2000 or 2nd of Jan 2000. St can mean either Street or Saint in addresses. If a mistake is made during standardisation and cleansing, it is usually difficult to recover from it to perform linkage correctly.
Instead of standardising and cleansing data into canonical forms, we rely on redundancy in data to overcome data quality issues. We say a record contains redundancy if one of its subrecords can uniquely identify the same entity. For example, if there is only one John Smith living in Elizabeth Street, then John Smith, 45 Elizabeth Street as a record of a person contains redundancy, because specifying street number 45 is not really necessary.
Redundancy exists widely in data. Not every country has a city named Canberra. Not every bank has a branch in Bungendore. As an extreme case, three numbers 23 24 5600 might be sufficient to specify an address globally, if there is only one address in the world containing these three numbers at the same time. In this case, we do not even need to know if 23 is a unit number or the first part of a street number. Perhaps counter-intuitively, such seemingly extreme examples are actually quite common in practice. For example, among the Australian addresses stored in the Open Address database (https://openaddresses.io/), of them can be uniquely identified by three numbers in them.
Redundancy simplifies ER. If two records share a common subrecord that can be used to uniquely identify an entity, then these two records can be linked no matter what data quality issues they each have. We call such a subrecord a signature of its entity. Probabilistic identification of signatures in data and linking records using such probabilistic signatures is the first key idea of our algorithm.
Data-Driven Blocking using Signatures Blocking is a widely used technique to improve ER efficiency [?]. Naïvely, linking two databases containing and records respectively requires record pair comparisons. Most of these comparisons lead to non-matches, i.e. they correspond to two records that refer to different entities. To reject these non-matches with a lower cost, one may first partition the raw records according to criteria selected by a user. These criteria are called blocking keys [?]. Examples of blocking keys include attributes such as first and last name, postcode, and so on. During linkage, comparison is only carried out between records that fall into the same partition, based on the assumption that records sharing no blocking keys do not match with each other.
The efficiency and completeness of ER is largely determined by blocking-key selection, which again is challenging in itself. If the keys are not distinctive between disparate entities, many irrelevant records will be placed into the same block, which gains little improvement in efficiency. If the keys are not invariant with respect to records of the same entities, records of the same entity will be inserted into different blocks and many true matching record pairs will be missed. If the key values do not distribute evenly among the records, the largest few blocks will form the bottleneck of ER efficiency. When dealing with a large dataset, it is challenging to balance all these concerns. Moreover, the performance of blocking keys also depends on the accuracy of any data standardisation and cleansing performed [?].
In an ideal world, we would like to use signatures as the blocking key and place only records of the same entity into the same block. In practice, we do not know which subrecords are signatures but we can still approximate the strategy by blocking on probabilistically identified signatures. These probabilistic signatures tend to be empirically distinctive and exhibit low-frequency in the database, which allows small and accurate blocks to be constructed. The only risk is these blocking keys may not be invariant with respect to records of the same entities. To address this, we introduce an inter-block connected component algorithm, which is explained next.
Connected Components for Transitive Linkage As discussed above, the blocking-by-probabilistic-signature technique leads to quite targetted blocking of records, with high precision but possibly low recall. This is in contrast to standard blocking techniques that tend to have low precision but high recall [?]. To compensate for the loss in recall, we allow each record to be inserted into multiple blocks, using the fact that each record may contain multiple distinct signatures. Moreover, to link records of the same entity that do not share the same signature, we allow two records in two different blocks to be linked if they are linked to the same third record in their own blocks. To implement such an indirect (transitive) link, we run a connected component algorithm to assign records connected directly or indirectly with the same label (entity identifier).
A particular challenge in our context is the size of the graphs we have to deal with. There are as many nodes as the number of records. Such a graph can be too large to fit into main memory. Random access to nodes in the graph, which is required by traditional depth/breadth-first search algorithms, might therefore not be feasible. To addres this, we propose a connected-component labelling algorithm that fits large graphs that are stored in a distributed database. The algorithm uses standard relational database operations, such as grouping and join, in an iterative way and converges in linear time. This connected component operation allows us not only to use small-sized data blocks, but also to link highly inconsistent records of the same entity transitively.
Implementation on Parallel Databases Massively parallel processing databases like Teradata and Greenplum have long supported parallelised SQL that scales to large datasets. Recent advances in large-scale in-database analytics platforms [?, ?] have shown how sophisticated machine learning algorithms can be implemented on top of a declarative language like SQL or MapReduce to scale to Petabyte-sized datasets on cluster computing.
One merit of our proposed method is it can be implemented on parallelised SQL using around ten SQL statements. As our experiments presented in Section LABEL:sec-experiments show, our algorithm can link datasets containing thousands of records in seconds, millions of records in minutes, and billions of records in hours on medium-sized clusters built using inexpensive commodity hardware.
Paper Contributions The contributions of this paper is a novel ER algorithm that
introduces a probabilistic technique to identify, from unlabelled data, entity signatures derived from a first-principles formulation of the ER problem;
introduces a new and effective data-driven blocking technique based on the occurrence of common probabilistic signatures in two records;
incorporates a scalable connected-component labelling algorithm that uses inverted-index data structures and parallel databases to compute transitive linkages in large graphs (tens to hundreds of millions of nodes);
is simple and scalable, allowing the whole algorithm to be written in 10 standard SQL statements on modern parallel data platforms like Greenplum and Spark;
achieves state-of-the-art performance on several benchmark datasets and pushes the scalability boundary of existing ER algorithms.
Our paper also provides a positive answer to an open research problem raised by Papadakis and Palpanas [?] about the existence of scalable and accurate data-driven blocking algorithms.
The reminder of the paper is structured as follows. In the following section we review research related to our work. In Section Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases we formulate the ER problem we aim to tackle, and in Section LABEL:sec-signatures we describe how we identify the signatures of entities in a probabilistic way. In Section LABEL:sec:cc we propose a scalable graph-labelling algorithm which we use to efficiently identify transitive links, and in Section LABEL:sec:signature-er we present the overall algorithm for signature-based ER. Experimental results are presented in Section LABEL:sec-experiments, followed by a general discussion in Section LABEL:sec:discussion and conclusion in Section LABEL:sec:conclusion.
Entity resolution (ER), also known as record linkage and data matching [?], has a long history with first computer based techniques being developed over five decades ago [?, ?]. The major challenges of linkage quality and scalability have been ongoing as databases continue to grow in size and complexity, and more diverse databases have to be linked [?]. ER is a topic of research in a variety of domains, ranging from computer science [?, ?] and statistics [?] to the health and social sciences [?].
The ER process generally consists of three major steps [?]: blocking/indexing, record comparison, and classification. In the first step, as discussed earlier, the databases are split into blocks (or clusters), and in the second step pairs of records within the same blocks are compared with each other. Even after data cleansing and standardisation of the input data-bases (if applied) there can still be variations of and errors in the attribute values to be compared, and therefore approximate string comparison functions (such as edit distance, the Jaro-Winkler comparator, or Jaccard similarity [?]) are employed to compare pairs of records. Each compared record pair results in a vector of similarities (one similarity per attribute compared), and these similarity vectors are then used to classify record pairs into matches (where it is assumed the two records in a pair correspond to the same entity) and non-matches (where the records are assumed to correspond to two different entities). Various classification methods have been employed in ER [?, ?], ranging from simple threshold-based to sophisticated clustering, supervised classification techniques, as well as active learning approaches.
Traditional blocking [?] uses one or more attributes as blocking key to insert records that share the same value in their blocking key into the same block. Only records within the same block are then compared with each other. To overcome variations and misspellings, the attribute values used in blocking keys are often phonetically encoded using functions such as Soundex or Double-Metaphone [?] which convert a string into a code according to how the string is pronounced. The same code is assigned to similar sounding names (such as ‘Dickson’ and ‘Dixon’). Multiple blocking keys may also be used to deal with the problem of missing attribute values [?].
An alternative to traditional blocking is the sorted neighbourhood method [?, ?], where the databases to be linked are sorted according to a sorting key (usually a concatenation of the values from several attributes), and a sliding window is moved over the databases. Only records within the window are then compared with each other. Another way to block databases is using canopy clustering [?], where a computationally efficient similarity measure (such as Jaccard similarity based on character q-grams as generated from attribute values) is used to inserts records into one or more overlapping clusters, and records that are in the same cluster (block) are then compared with each other.
While these existing blocking techniques are schema-based and require a user to decide which attributes(s) to use for blocking, sorting or clustering, more recent work has investigated schema-agnostic approaches that generate some form of signature for each record automatically from all attribute values [?, ?]. While schema agnostic approaches, such as ours, can be attractive as they do not require manual selection of blocking or sorting keys by domain experts, they can lead to sub-optimal blocking performance and might require additional meta-blocking steps [?] to achieve both high effectiveness and efficiency by for example removing blocks that are too large or that have a high overlap with other blocks.
One schema-agnostic approach to blocking is Locality Sensitive Hashing (LSH), as originally developed for efficient nearest-neighbour search in high-dimensional spaces [?]. LSH has been employed for blocking in ER by hashing attribute values multiple times and comparing records that share some hash values. One ER approach based on MinHash [?] and LSH is HARRA [?], which iteratively blocks, compares, and then merges records, where merged records are re-hashed to improve the overall ER quality. However, as a recent evaluation of blocking techniques has found [?], blocking based on LSH needs to be carefully tuned to a specific database in order to achieve both high effectiveness and efficiency. This requires high quality training data which is not available in many real-world ER applications.
Compared to existing approaches to ER, the distinguishing feature of our ER algorithm is a data-driven blocking-by-signature technique that deliberately trade-off recall in favour of high precision. This is in contrast to the standard practice of trading off precision in favour of high recall in most existing blocking algorithms. To compensate for potential low-recall resulting from our blocking technique, we introduce an additional Global Connected Component step into the ER process, which turns out to be efficiently computable. As we shall see, this slightly unusual combination of ideas yielded a new, simple algorithm that achieves state-of-the-art ER results on a range of datasets, both in terms of accuracy and scalability.
The ER problem is usually loosely defined as the problem of determining which records in a database refer to the same entities. This informal definition can hide many assumptions, especially on the meaning of the term “same entities”. To avoid confusion, we now define our ER setting in a more precise manner.
A possible world is a tuple , where denotes a set of words; denotes the set of all records, where a record is a sequence of words from (i.e. order matters); denotes a set of entity identifiers; and is a subset of the Cartesian product between and . \@endtheorem
We say record refers to entity , if . Note that an entity may be referred to by multiple (possibly inconsistent) records, and each record may refer to multiple entities, i.e., there are ambiguous records. Some records may belong to no entities in . For example, John Smith, Sydney is likely referring to several individuals named John Smith who live in Sydney, and therefore this record is ambiguous as it can refer to any of them. On the other hand, in real-world databases there are often records that contain randomly generated, faked, or corrupted values, such as those used to test a system or that were intentionally modified (for example John Doe or (123) 456-7890) by a user who does not want to provide their actual personal details [?].
In practice, a possible world is only ‘knowable’ through a (finite) set of observations sampled from it.