Details
Original language | English |
---|---|
Title of host publication | 2023 IEEE 39th International Conference on Data Engineering |
Subtitle of host publication | ICDE 2023 |
Publisher | IEEE Computer Society |
Pages | 653-666 |
Number of pages | 14 |
ISBN (electronic) | 9798350322279 |
Publication status | Published - 2023 |
Event | 39th IEEE International Conference on Data Engineering, ICDE 2023 - Anaheim, United States Duration: 3 Apr 2023 → 7 Apr 2023 |
Publication series
Name | Proceedings - International Conference on Data Engineering |
---|---|
Volume | 2023-April |
ISSN (Print) | 1084-4627 |
Abstract
Entity Resolution is the task of identifying pairs of entity profiles that represent the same real-world object. To avoid checking a quadratic number of entity pairs, various filtering techniques have been proposed that fall into two main categories: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor methods convert all entity profiles into vectors and identify the closest ones to every query entity. Unfortunately, the main techniques from these two categories have rarely been compared in the literature and, thus, their relative performance is unknown. We perform the first systematic experimental study that investigates the relative performance of the main representatives per category over numerous established datasets. Comparing techniques from different categories turns out to be a non-trivial task due to the various configuration parameters that are hard to fine-tune, but have a significant impact on performance. We consider a plethora of parameter configurations, optimizing each technique with respect to recall and precision targets. Both schema-agnostic and schema-based settings are evaluated. The experimental results provide novel insights into the effectiveness, the time efficiency and the scalability of the considered techniques.
Keywords
- deduplication, nearest neighbors, record linkage
ASJC Scopus subject areas
- Computer Science(all)
- Software
- Computer Science(all)
- Signal Processing
- Computer Science(all)
- Information Systems
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
2023 IEEE 39th International Conference on Data Engineering: ICDE 2023. IEEE Computer Society, 2023. p. 653-666 (Proceedings - International Conference on Data Engineering; Vol. 2023-April).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Benchmarking Filtering Techniques for Entity Resolution
AU - Papadakis, George
AU - Fisichella, Marco
AU - Schoger, Franziska
AU - Mandilaras, George
AU - Augsten, Nikolaus
AU - Nejdl, Wolfgang
N1 - Funding Information: In the future, we will enrich the Continuous Benchmark of Filtering methods for ER with new datasets and will update the rankings per dataset with new filtering methods. Acknowledgements. This research was partially funded by the Austrian Science Fund (FWF) P 34962, EU Horizon Europe GA No 101070122 (STELAR), the Hellenic Foundation for Research and Innovation (Project Number: HFRI-FM17-2351 GeoQA) and the European Commission through the xAIM project, agreement No INEA/CEF/ICT/A2020/ 2276680. For the purpose of open access, the authors have applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.
PY - 2023
Y1 - 2023
N2 - Entity Resolution is the task of identifying pairs of entity profiles that represent the same real-world object. To avoid checking a quadratic number of entity pairs, various filtering techniques have been proposed that fall into two main categories: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor methods convert all entity profiles into vectors and identify the closest ones to every query entity. Unfortunately, the main techniques from these two categories have rarely been compared in the literature and, thus, their relative performance is unknown. We perform the first systematic experimental study that investigates the relative performance of the main representatives per category over numerous established datasets. Comparing techniques from different categories turns out to be a non-trivial task due to the various configuration parameters that are hard to fine-tune, but have a significant impact on performance. We consider a plethora of parameter configurations, optimizing each technique with respect to recall and precision targets. Both schema-agnostic and schema-based settings are evaluated. The experimental results provide novel insights into the effectiveness, the time efficiency and the scalability of the considered techniques.
AB - Entity Resolution is the task of identifying pairs of entity profiles that represent the same real-world object. To avoid checking a quadratic number of entity pairs, various filtering techniques have been proposed that fall into two main categories: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor methods convert all entity profiles into vectors and identify the closest ones to every query entity. Unfortunately, the main techniques from these two categories have rarely been compared in the literature and, thus, their relative performance is unknown. We perform the first systematic experimental study that investigates the relative performance of the main representatives per category over numerous established datasets. Comparing techniques from different categories turns out to be a non-trivial task due to the various configuration parameters that are hard to fine-tune, but have a significant impact on performance. We consider a plethora of parameter configurations, optimizing each technique with respect to recall and precision targets. Both schema-agnostic and schema-based settings are evaluated. The experimental results provide novel insights into the effectiveness, the time efficiency and the scalability of the considered techniques.
KW - deduplication
KW - nearest neighbors
KW - record linkage
UR - http://www.scopus.com/inward/record.url?scp=85160317336&partnerID=8YFLogxK
U2 - 10.48550/arXiv.2202.12521
DO - 10.48550/arXiv.2202.12521
M3 - Conference contribution
AN - SCOPUS:85160317336
T3 - Proceedings - International Conference on Data Engineering
SP - 653
EP - 666
BT - 2023 IEEE 39th International Conference on Data Engineering
PB - IEEE Computer Society
T2 - 39th IEEE International Conference on Data Engineering, ICDE 2023
Y2 - 3 April 2023 through 7 April 2023
ER -