Benchmarking Filtering Techniques for Entity Resolution

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

Research Organisations

External Research Organisations

  • University of Athens
  • University of Salzburg
View graph of relations

Details

Original languageEnglish
Title of host publication2023 IEEE 39th International Conference on Data Engineering
Subtitle of host publicationICDE 2023
PublisherIEEE Computer Society
Pages653-666
Number of pages14
ISBN (electronic)9798350322279
Publication statusPublished - 2023
Event39th IEEE International Conference on Data Engineering, ICDE 2023 - Anaheim, United States
Duration: 3 Apr 20237 Apr 2023

Publication series

NameProceedings - International Conference on Data Engineering
Volume2023-April
ISSN (Print)1084-4627

Abstract

Entity Resolution is the task of identifying pairs of entity profiles that represent the same real-world object. To avoid checking a quadratic number of entity pairs, various filtering techniques have been proposed that fall into two main categories: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor methods convert all entity profiles into vectors and identify the closest ones to every query entity. Unfortunately, the main techniques from these two categories have rarely been compared in the literature and, thus, their relative performance is unknown. We perform the first systematic experimental study that investigates the relative performance of the main representatives per category over numerous established datasets. Comparing techniques from different categories turns out to be a non-trivial task due to the various configuration parameters that are hard to fine-tune, but have a significant impact on performance. We consider a plethora of parameter configurations, optimizing each technique with respect to recall and precision targets. Both schema-agnostic and schema-based settings are evaluated. The experimental results provide novel insights into the effectiveness, the time efficiency and the scalability of the considered techniques.

Keywords

    deduplication, nearest neighbors, record linkage

ASJC Scopus subject areas

Cite this

Benchmarking Filtering Techniques for Entity Resolution. / Papadakis, George; Fisichella, Marco; Schoger, Franziska et al.
2023 IEEE 39th International Conference on Data Engineering: ICDE 2023. IEEE Computer Society, 2023. p. 653-666 (Proceedings - International Conference on Data Engineering; Vol. 2023-April).

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Papadakis, G, Fisichella, M, Schoger, F, Mandilaras, G, Augsten, N & Nejdl, W 2023, Benchmarking Filtering Techniques for Entity Resolution. in 2023 IEEE 39th International Conference on Data Engineering: ICDE 2023. Proceedings - International Conference on Data Engineering, vol. 2023-April, IEEE Computer Society, pp. 653-666, 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, United States, 3 Apr 2023. https://doi.org/10.48550/arXiv.2202.12521, https://doi.org/10.1109/ICDE55515.2023.00389
Papadakis, G., Fisichella, M., Schoger, F., Mandilaras, G., Augsten, N., & Nejdl, W. (2023). Benchmarking Filtering Techniques for Entity Resolution. In 2023 IEEE 39th International Conference on Data Engineering: ICDE 2023 (pp. 653-666). (Proceedings - International Conference on Data Engineering; Vol. 2023-April). IEEE Computer Society. https://doi.org/10.48550/arXiv.2202.12521, https://doi.org/10.1109/ICDE55515.2023.00389
Papadakis G, Fisichella M, Schoger F, Mandilaras G, Augsten N, Nejdl W. Benchmarking Filtering Techniques for Entity Resolution. In 2023 IEEE 39th International Conference on Data Engineering: ICDE 2023. IEEE Computer Society. 2023. p. 653-666. (Proceedings - International Conference on Data Engineering). doi: 10.48550/arXiv.2202.12521, 10.1109/ICDE55515.2023.00389
Papadakis, George ; Fisichella, Marco ; Schoger, Franziska et al. / Benchmarking Filtering Techniques for Entity Resolution. 2023 IEEE 39th International Conference on Data Engineering: ICDE 2023. IEEE Computer Society, 2023. pp. 653-666 (Proceedings - International Conference on Data Engineering).
Download
@inproceedings{fc69fde9c2094a8caf22b377f7805801,
title = "Benchmarking Filtering Techniques for Entity Resolution",
abstract = "Entity Resolution is the task of identifying pairs of entity profiles that represent the same real-world object. To avoid checking a quadratic number of entity pairs, various filtering techniques have been proposed that fall into two main categories: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor methods convert all entity profiles into vectors and identify the closest ones to every query entity. Unfortunately, the main techniques from these two categories have rarely been compared in the literature and, thus, their relative performance is unknown. We perform the first systematic experimental study that investigates the relative performance of the main representatives per category over numerous established datasets. Comparing techniques from different categories turns out to be a non-trivial task due to the various configuration parameters that are hard to fine-tune, but have a significant impact on performance. We consider a plethora of parameter configurations, optimizing each technique with respect to recall and precision targets. Both schema-agnostic and schema-based settings are evaluated. The experimental results provide novel insights into the effectiveness, the time efficiency and the scalability of the considered techniques.",
keywords = "deduplication, nearest neighbors, record linkage",
author = "George Papadakis and Marco Fisichella and Franziska Schoger and George Mandilaras and Nikolaus Augsten and Wolfgang Nejdl",
note = "Funding Information: In the future, we will enrich the Continuous Benchmark of Filtering methods for ER with new datasets and will update the rankings per dataset with new filtering methods. Acknowledgements. This research was partially funded by the Austrian Science Fund (FWF) P 34962, EU Horizon Europe GA No 101070122 (STELAR), the Hellenic Foundation for Research and Innovation (Project Number: HFRI-FM17-2351 GeoQA) and the European Commission through the xAIM project, agreement No INEA/CEF/ICT/A2020/ 2276680. For the purpose of open access, the authors have applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission. ; 39th IEEE International Conference on Data Engineering, ICDE 2023 ; Conference date: 03-04-2023 Through 07-04-2023",
year = "2023",
doi = "10.48550/arXiv.2202.12521",
language = "English",
series = "Proceedings - International Conference on Data Engineering",
publisher = "IEEE Computer Society",
pages = "653--666",
booktitle = "2023 IEEE 39th International Conference on Data Engineering",
address = "United States",

}

Download

TY - GEN

T1 - Benchmarking Filtering Techniques for Entity Resolution

AU - Papadakis, George

AU - Fisichella, Marco

AU - Schoger, Franziska

AU - Mandilaras, George

AU - Augsten, Nikolaus

AU - Nejdl, Wolfgang

N1 - Funding Information: In the future, we will enrich the Continuous Benchmark of Filtering methods for ER with new datasets and will update the rankings per dataset with new filtering methods. Acknowledgements. This research was partially funded by the Austrian Science Fund (FWF) P 34962, EU Horizon Europe GA No 101070122 (STELAR), the Hellenic Foundation for Research and Innovation (Project Number: HFRI-FM17-2351 GeoQA) and the European Commission through the xAIM project, agreement No INEA/CEF/ICT/A2020/ 2276680. For the purpose of open access, the authors have applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.

PY - 2023

Y1 - 2023

N2 - Entity Resolution is the task of identifying pairs of entity profiles that represent the same real-world object. To avoid checking a quadratic number of entity pairs, various filtering techniques have been proposed that fall into two main categories: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor methods convert all entity profiles into vectors and identify the closest ones to every query entity. Unfortunately, the main techniques from these two categories have rarely been compared in the literature and, thus, their relative performance is unknown. We perform the first systematic experimental study that investigates the relative performance of the main representatives per category over numerous established datasets. Comparing techniques from different categories turns out to be a non-trivial task due to the various configuration parameters that are hard to fine-tune, but have a significant impact on performance. We consider a plethora of parameter configurations, optimizing each technique with respect to recall and precision targets. Both schema-agnostic and schema-based settings are evaluated. The experimental results provide novel insights into the effectiveness, the time efficiency and the scalability of the considered techniques.

AB - Entity Resolution is the task of identifying pairs of entity profiles that represent the same real-world object. To avoid checking a quadratic number of entity pairs, various filtering techniques have been proposed that fall into two main categories: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor methods convert all entity profiles into vectors and identify the closest ones to every query entity. Unfortunately, the main techniques from these two categories have rarely been compared in the literature and, thus, their relative performance is unknown. We perform the first systematic experimental study that investigates the relative performance of the main representatives per category over numerous established datasets. Comparing techniques from different categories turns out to be a non-trivial task due to the various configuration parameters that are hard to fine-tune, but have a significant impact on performance. We consider a plethora of parameter configurations, optimizing each technique with respect to recall and precision targets. Both schema-agnostic and schema-based settings are evaluated. The experimental results provide novel insights into the effectiveness, the time efficiency and the scalability of the considered techniques.

KW - deduplication

KW - nearest neighbors

KW - record linkage

UR - http://www.scopus.com/inward/record.url?scp=85160317336&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2202.12521

DO - 10.48550/arXiv.2202.12521

M3 - Conference contribution

AN - SCOPUS:85160317336

T3 - Proceedings - International Conference on Data Engineering

SP - 653

EP - 666

BT - 2023 IEEE 39th International Conference on Data Engineering

PB - IEEE Computer Society

T2 - 39th IEEE International Conference on Data Engineering, ICDE 2023

Y2 - 3 April 2023 through 7 April 2023

ER -

By the same author(s)