Improved suffix blocking for record linkage and entity resolution

Amin Allam, Spiros Skiadopoulos, Panos Kalnis

Research output: Contribution to journalArticlepeer-review

15 Scopus citations

Abstract

Record linkage is the problem that identifies the different records that represent the same real-world object. Entity resolution is the problem that ensures that a real-world object is represented by a single record. The incremental versions of record linkage and entity resolution address the respective problems after the insertion of a new record in the dataset. Record linkage, entity resolution and their incremental versions are of paramount importance and arise in several contexts such as data warehouses, heterogeneous databases and data analysis. Blocking techniques are usually utilized to address these problems in order to avoid comparing all record pairs. Suffix blocking is one of the most efficient and accurate blocking techniques. In this paper, we consider the non-incremental variation of record linkage and present a method that is more than five times faster and achieves similar accuracy to the current state-of-the-art suffix-based blocking method. Then, we consider the incremental variation of record linkage and propose a novel incremental suffix-based blocking mechanism that outperforms existing incremental blocking methods in terms of blocking accuracy and efficiency. Finally, we consider incremental entity resolution and present two novel techniques based on suffix blocking that are able to handle the tested dataset in a few seconds (while a current state-of-the-art technique requires more than eight hours). Our second technique proposes a novel method that keeps a history of the deleted records and the merging process. Thus, we are able to discover alternative matches for the inserted record that are not possible for existing methods and improve the accuracy of the algorithm. We have implemented and extensively experimentally evaluated all our methods. We offer two implementations of our proposals. The first one is memory-based and offers the best efficiency while the second one is disk-based and scales seamlessly to very large datasets.
Original languageEnglish (US)
Pages (from-to)98-113
Number of pages16
JournalData & Knowledge Engineering
Volume117
DOIs
StatePublished - Jul 30 2018

Bibliographical note

KAUST Repository Item: Exported on 2020-10-01
Acknowledgements: This work was supported by King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.

Fingerprint

Dive into the research topics of 'Improved suffix blocking for record linkage and entity resolution'. Together they form a unique fingerprint.

Cite this