Benchmarking Filtering Techniques for Entity Resolution

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autoren

Organisationseinheiten

Externe Organisationen

  • University of Athens
  • Universität Salzburg
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des Sammelwerks2023 IEEE 39th International Conference on Data Engineering
UntertitelICDE 2023
Herausgeber (Verlag)IEEE Computer Society
Seiten653-666
Seitenumfang14
ISBN (elektronisch)9798350322279
PublikationsstatusVeröffentlicht - 2023
Veranstaltung39th IEEE International Conference on Data Engineering, ICDE 2023 - Anaheim, USA / Vereinigte Staaten
Dauer: 3 Apr. 20237 Apr. 2023

Publikationsreihe

NameProceedings - International Conference on Data Engineering
Band2023-April
ISSN (Print)1084-4627

Abstract

Entity Resolution is the task of identifying pairs of entity profiles that represent the same real-world object. To avoid checking a quadratic number of entity pairs, various filtering techniques have been proposed that fall into two main categories: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor methods convert all entity profiles into vectors and identify the closest ones to every query entity. Unfortunately, the main techniques from these two categories have rarely been compared in the literature and, thus, their relative performance is unknown. We perform the first systematic experimental study that investigates the relative performance of the main representatives per category over numerous established datasets. Comparing techniques from different categories turns out to be a non-trivial task due to the various configuration parameters that are hard to fine-tune, but have a significant impact on performance. We consider a plethora of parameter configurations, optimizing each technique with respect to recall and precision targets. Both schema-agnostic and schema-based settings are evaluated. The experimental results provide novel insights into the effectiveness, the time efficiency and the scalability of the considered techniques.

ASJC Scopus Sachgebiete

Zitieren

Benchmarking Filtering Techniques for Entity Resolution. / Papadakis, George; Fisichella, Marco; Schoger, Franziska et al.
2023 IEEE 39th International Conference on Data Engineering: ICDE 2023. IEEE Computer Society, 2023. S. 653-666 (Proceedings - International Conference on Data Engineering; Band 2023-April).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Papadakis, G, Fisichella, M, Schoger, F, Mandilaras, G, Augsten, N & Nejdl, W 2023, Benchmarking Filtering Techniques for Entity Resolution. in 2023 IEEE 39th International Conference on Data Engineering: ICDE 2023. Proceedings - International Conference on Data Engineering, Bd. 2023-April, IEEE Computer Society, S. 653-666, 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, USA / Vereinigte Staaten, 3 Apr. 2023. https://doi.org/10.48550/arXiv.2202.12521, https://doi.org/10.1109/ICDE55515.2023.00389
Papadakis, G., Fisichella, M., Schoger, F., Mandilaras, G., Augsten, N., & Nejdl, W. (2023). Benchmarking Filtering Techniques for Entity Resolution. In 2023 IEEE 39th International Conference on Data Engineering: ICDE 2023 (S. 653-666). (Proceedings - International Conference on Data Engineering; Band 2023-April). IEEE Computer Society. https://doi.org/10.48550/arXiv.2202.12521, https://doi.org/10.1109/ICDE55515.2023.00389
Papadakis G, Fisichella M, Schoger F, Mandilaras G, Augsten N, Nejdl W. Benchmarking Filtering Techniques for Entity Resolution. in 2023 IEEE 39th International Conference on Data Engineering: ICDE 2023. IEEE Computer Society. 2023. S. 653-666. (Proceedings - International Conference on Data Engineering). doi: 10.48550/arXiv.2202.12521, 10.1109/ICDE55515.2023.00389
Papadakis, George ; Fisichella, Marco ; Schoger, Franziska et al. / Benchmarking Filtering Techniques for Entity Resolution. 2023 IEEE 39th International Conference on Data Engineering: ICDE 2023. IEEE Computer Society, 2023. S. 653-666 (Proceedings - International Conference on Data Engineering).
Download
@inproceedings{fc69fde9c2094a8caf22b377f7805801,
title = "Benchmarking Filtering Techniques for Entity Resolution",
abstract = "Entity Resolution is the task of identifying pairs of entity profiles that represent the same real-world object. To avoid checking a quadratic number of entity pairs, various filtering techniques have been proposed that fall into two main categories: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor methods convert all entity profiles into vectors and identify the closest ones to every query entity. Unfortunately, the main techniques from these two categories have rarely been compared in the literature and, thus, their relative performance is unknown. We perform the first systematic experimental study that investigates the relative performance of the main representatives per category over numerous established datasets. Comparing techniques from different categories turns out to be a non-trivial task due to the various configuration parameters that are hard to fine-tune, but have a significant impact on performance. We consider a plethora of parameter configurations, optimizing each technique with respect to recall and precision targets. Both schema-agnostic and schema-based settings are evaluated. The experimental results provide novel insights into the effectiveness, the time efficiency and the scalability of the considered techniques.",
keywords = "deduplication, nearest neighbors, record linkage",
author = "George Papadakis and Marco Fisichella and Franziska Schoger and George Mandilaras and Nikolaus Augsten and Wolfgang Nejdl",
note = "Funding Information: In the future, we will enrich the Continuous Benchmark of Filtering methods for ER with new datasets and will update the rankings per dataset with new filtering methods. Acknowledgements. This research was partially funded by the Austrian Science Fund (FWF) P 34962, EU Horizon Europe GA No 101070122 (STELAR), the Hellenic Foundation for Research and Innovation (Project Number: HFRI-FM17-2351 GeoQA) and the European Commission through the xAIM project, agreement No INEA/CEF/ICT/A2020/ 2276680. For the purpose of open access, the authors have applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission. ; 39th IEEE International Conference on Data Engineering, ICDE 2023 ; Conference date: 03-04-2023 Through 07-04-2023",
year = "2023",
doi = "10.48550/arXiv.2202.12521",
language = "English",
series = "Proceedings - International Conference on Data Engineering",
publisher = "IEEE Computer Society",
pages = "653--666",
booktitle = "2023 IEEE 39th International Conference on Data Engineering",
address = "United States",

}

Download

TY - GEN

T1 - Benchmarking Filtering Techniques for Entity Resolution

AU - Papadakis, George

AU - Fisichella, Marco

AU - Schoger, Franziska

AU - Mandilaras, George

AU - Augsten, Nikolaus

AU - Nejdl, Wolfgang

N1 - Funding Information: In the future, we will enrich the Continuous Benchmark of Filtering methods for ER with new datasets and will update the rankings per dataset with new filtering methods. Acknowledgements. This research was partially funded by the Austrian Science Fund (FWF) P 34962, EU Horizon Europe GA No 101070122 (STELAR), the Hellenic Foundation for Research and Innovation (Project Number: HFRI-FM17-2351 GeoQA) and the European Commission through the xAIM project, agreement No INEA/CEF/ICT/A2020/ 2276680. For the purpose of open access, the authors have applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.

PY - 2023

Y1 - 2023

N2 - Entity Resolution is the task of identifying pairs of entity profiles that represent the same real-world object. To avoid checking a quadratic number of entity pairs, various filtering techniques have been proposed that fall into two main categories: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor methods convert all entity profiles into vectors and identify the closest ones to every query entity. Unfortunately, the main techniques from these two categories have rarely been compared in the literature and, thus, their relative performance is unknown. We perform the first systematic experimental study that investigates the relative performance of the main representatives per category over numerous established datasets. Comparing techniques from different categories turns out to be a non-trivial task due to the various configuration parameters that are hard to fine-tune, but have a significant impact on performance. We consider a plethora of parameter configurations, optimizing each technique with respect to recall and precision targets. Both schema-agnostic and schema-based settings are evaluated. The experimental results provide novel insights into the effectiveness, the time efficiency and the scalability of the considered techniques.

AB - Entity Resolution is the task of identifying pairs of entity profiles that represent the same real-world object. To avoid checking a quadratic number of entity pairs, various filtering techniques have been proposed that fall into two main categories: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor methods convert all entity profiles into vectors and identify the closest ones to every query entity. Unfortunately, the main techniques from these two categories have rarely been compared in the literature and, thus, their relative performance is unknown. We perform the first systematic experimental study that investigates the relative performance of the main representatives per category over numerous established datasets. Comparing techniques from different categories turns out to be a non-trivial task due to the various configuration parameters that are hard to fine-tune, but have a significant impact on performance. We consider a plethora of parameter configurations, optimizing each technique with respect to recall and precision targets. Both schema-agnostic and schema-based settings are evaluated. The experimental results provide novel insights into the effectiveness, the time efficiency and the scalability of the considered techniques.

KW - deduplication

KW - nearest neighbors

KW - record linkage

UR - http://www.scopus.com/inward/record.url?scp=85160317336&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2202.12521

DO - 10.48550/arXiv.2202.12521

M3 - Conference contribution

AN - SCOPUS:85160317336

T3 - Proceedings - International Conference on Data Engineering

SP - 653

EP - 666

BT - 2023 IEEE 39th International Conference on Data Engineering

PB - IEEE Computer Society

T2 - 39th IEEE International Conference on Data Engineering, ICDE 2023

Y2 - 3 April 2023 through 7 April 2023

ER -

Von denselben Autoren