Details
Original language | English |
---|---|
Pages (from-to) | 1671-1696 |
Number of pages | 26 |
Journal | VLDB Journal |
Volume | 33 |
Issue number | 5 |
Early online date | 9 Jul 2024 |
Publication status | Published - Sept 2024 |
Abstract
Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.
Keywords
- Blocking, Entity resolution, Filtering, Nearest neighbors
ASJC Scopus subject areas
- Computer Science(all)
- Information Systems
- Computer Science(all)
- Hardware and Architecture
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
In: VLDB Journal, Vol. 33, No. 5, 09.2024, p. 1671-1696.
Research output: Contribution to journal › Article › Research › peer review
}
TY - JOUR
T1 - Open benchmark for filtering techniques in entity resolution
AU - Neuhof, Franziska
AU - Fisichella, Marco
AU - Papadakis, George
AU - Nikoletos, Konstantinos
AU - Augsten, Nikolaus
AU - Nejdl, Wolfgang
AU - Koubarakis, Manolis
N1 - Publisher Copyright: © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.
PY - 2024/9
Y1 - 2024/9
N2 - Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.
AB - Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.
KW - Blocking
KW - Entity resolution
KW - Filtering
KW - Nearest neighbors
UR - http://www.scopus.com/inward/record.url?scp=85198121701&partnerID=8YFLogxK
U2 - 10.1007/s00778-024-00868-7
DO - 10.1007/s00778-024-00868-7
M3 - Article
AN - SCOPUS:85198121701
VL - 33
SP - 1671
EP - 1696
JO - VLDB Journal
JF - VLDB Journal
SN - 1066-8888
IS - 5
ER -