Open benchmark for filtering techniques in entity resolution

Franziska Neuhof; Marco Fisichella; George Papadakis; Konstantinos Nikoletos; Nikolaus Augsten; Wolfgang Nejdl; Manolis Koubarakis

doi:10.1007/s00778-024-00868-7

Details

Original language	English
Pages (from-to)	1671-1696
Number of pages	26
Journal	VLDB Journal
Volume	33
Issue number	5
Early online date	9 Jul 2024
Publication status	Published - Sept 2024

Abstract

Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.

Keywords

Blocking, Entity resolution, Filtering, Nearest neighbors

ASJC Scopus subject areas

Computer Science(all)
Information Systems
Computer Science(all)
Hardware and Architecture

Cite this

Open benchmark for filtering techniques in entity resolution. / Neuhof, Franziska; Fisichella, Marco; Papadakis, George et al.
In: VLDB Journal, Vol. 33, No. 5, 09.2024, p. 1671-1696.

Research output: Contribution to journal › Article › Research › peer review

Neuhof, F, Fisichella, M, Papadakis, G, Nikoletos, K, Augsten, N, Nejdl, W & Koubarakis, M 2024, 'Open benchmark for filtering techniques in entity resolution', VLDB Journal, vol. 33, no. 5, pp. 1671-1696. https://doi.org/10.1007/s00778-024-00868-7

Neuhof, F., Fisichella, M., Papadakis, G., Nikoletos, K., Augsten, N., Nejdl, W., & Koubarakis, M. (2024). Open benchmark for filtering techniques in entity resolution. VLDB Journal, 33(5), 1671-1696. https://doi.org/10.1007/s00778-024-00868-7

Neuhof F, Fisichella M, Papadakis G, Nikoletos K, Augsten N, Nejdl W et al. Open benchmark for filtering techniques in entity resolution. VLDB Journal. 2024 Sept;33(5):1671-1696. Epub 2024 Jul 9. doi: 10.1007/s00778-024-00868-7

Neuhof, Franziska ; Fisichella, Marco ; Papadakis, George et al. / Open benchmark for filtering techniques in entity resolution. In: VLDB Journal. 2024 ; Vol. 33, No. 5. pp. 1671-1696.

Download

@article{985086de737545ed965fbeeb8d96bd51,

title = "Open benchmark for filtering techniques in entity resolution",

abstract = "Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.",

keywords = "Blocking, Entity resolution, Filtering, Nearest neighbors",

author = "Franziska Neuhof and Marco Fisichella and George Papadakis and Konstantinos Nikoletos and Nikolaus Augsten and Wolfgang Nejdl and Manolis Koubarakis",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.",

year = "2024",

month = sep,

doi = "10.1007/s00778-024-00868-7",

language = "English",

volume = "33",

pages = "1671--1696",

journal = "VLDB Journal",

issn = "1066-8888",

publisher = "Springer New York",

number = "5",

}

Download

TY - JOUR

T1 - Open benchmark for filtering techniques in entity resolution

AU - Neuhof, Franziska

AU - Fisichella, Marco

AU - Papadakis, George

AU - Nikoletos, Konstantinos

AU - Augsten, Nikolaus

AU - Nejdl, Wolfgang

AU - Koubarakis, Manolis

N1 - Publisher Copyright: © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.

PY - 2024/9

Y1 - 2024/9

N2 - Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.

AB - Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.

KW - Blocking

KW - Entity resolution

KW - Filtering

KW - Nearest neighbors

UR - http://www.scopus.com/inward/record.url?scp=85198121701&partnerID=8YFLogxK

U2 - 10.1007/s00778-024-00868-7

DO - 10.1007/s00778-024-00868-7

M3 - Article

AN - SCOPUS:85198121701

VL - 33

SP - 1671

EP - 1696

JO - VLDB Journal

JF - VLDB Journal

SN - 1066-8888

IS - 5

ER -

Research@Leibniz University

Open benchmark for filtering techniques in entity resolution

Authors

Research Organisations

External Research Organisations

Details

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

A Trustworthy Approach to Classify and Analyze Epidemic-Related Information From Microblogs

LaMMOn: language model combined graph neural network for multi-target multi-camera tracking in online scenarios

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

A Trustworthy Approach to Classify and Analyze Epidemic-Related Information From Microblogs

LaMMOn: language model combined graph neural network for multi-target multi-camera tracking in online scenarios

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets