Open benchmark for filtering techniques in entity resolution

Research output: Contribution to journalArticleResearchpeer review

Authors

Research Organisations

External Research Organisations

  • University of Athens
  • University of Salzburg
View graph of relations

Details

Original languageEnglish
Pages (from-to)1671-1696
Number of pages26
JournalVLDB Journal
Volume33
Issue number5
Early online date9 Jul 2024
Publication statusPublished - Sept 2024

Abstract

Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.

Keywords

    Blocking, Entity resolution, Filtering, Nearest neighbors

ASJC Scopus subject areas

Cite this

Open benchmark for filtering techniques in entity resolution. / Neuhof, Franziska; Fisichella, Marco; Papadakis, George et al.
In: VLDB Journal, Vol. 33, No. 5, 09.2024, p. 1671-1696.

Research output: Contribution to journalArticleResearchpeer review

Neuhof, F, Fisichella, M, Papadakis, G, Nikoletos, K, Augsten, N, Nejdl, W & Koubarakis, M 2024, 'Open benchmark for filtering techniques in entity resolution', VLDB Journal, vol. 33, no. 5, pp. 1671-1696. https://doi.org/10.1007/s00778-024-00868-7
Neuhof, F., Fisichella, M., Papadakis, G., Nikoletos, K., Augsten, N., Nejdl, W., & Koubarakis, M. (2024). Open benchmark for filtering techniques in entity resolution. VLDB Journal, 33(5), 1671-1696. https://doi.org/10.1007/s00778-024-00868-7
Neuhof F, Fisichella M, Papadakis G, Nikoletos K, Augsten N, Nejdl W et al. Open benchmark for filtering techniques in entity resolution. VLDB Journal. 2024 Sept;33(5):1671-1696. Epub 2024 Jul 9. doi: 10.1007/s00778-024-00868-7
Neuhof, Franziska ; Fisichella, Marco ; Papadakis, George et al. / Open benchmark for filtering techniques in entity resolution. In: VLDB Journal. 2024 ; Vol. 33, No. 5. pp. 1671-1696.
Download
@article{985086de737545ed965fbeeb8d96bd51,
title = "Open benchmark for filtering techniques in entity resolution",
abstract = "Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.",
keywords = "Blocking, Entity resolution, Filtering, Nearest neighbors",
author = "Franziska Neuhof and Marco Fisichella and George Papadakis and Konstantinos Nikoletos and Nikolaus Augsten and Wolfgang Nejdl and Manolis Koubarakis",
note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.",
year = "2024",
month = sep,
doi = "10.1007/s00778-024-00868-7",
language = "English",
volume = "33",
pages = "1671--1696",
journal = "VLDB Journal",
issn = "1066-8888",
publisher = "Springer New York",
number = "5",

}

Download

TY - JOUR

T1 - Open benchmark for filtering techniques in entity resolution

AU - Neuhof, Franziska

AU - Fisichella, Marco

AU - Papadakis, George

AU - Nikoletos, Konstantinos

AU - Augsten, Nikolaus

AU - Nejdl, Wolfgang

AU - Koubarakis, Manolis

N1 - Publisher Copyright: © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.

PY - 2024/9

Y1 - 2024/9

N2 - Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.

AB - Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.

KW - Blocking

KW - Entity resolution

KW - Filtering

KW - Nearest neighbors

UR - http://www.scopus.com/inward/record.url?scp=85198121701&partnerID=8YFLogxK

U2 - 10.1007/s00778-024-00868-7

DO - 10.1007/s00778-024-00868-7

M3 - Article

AN - SCOPUS:85198121701

VL - 33

SP - 1671

EP - 1696

JO - VLDB Journal

JF - VLDB Journal

SN - 1066-8888

IS - 5

ER -

By the same author(s)