Eliminating the Redundancy in Blocking-based Entity Resolution Methods

George Papadakis; Ekaterini Ioannou; Claudia Niederée; Themis Palpanas; Wolfgang Nejdl

doi:10.1145/1998076.1998093

Details

Originalsprache	Englisch
Titel des Sammelwerks	JCDL'11 - Proceedings of the 2011 ACM/IEEE Joint Conference on Digital Libraries
Seiten	85-94
Seitenumfang	10
Publikationsstatus	Veröffentlicht - 13 Juni 2011
Veranstaltung	11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL'11 - Ottawa, ON, Kanada Dauer: 13 Juni 2011 → 17 Juni 2011

Publikationsreihe

Name	Proceedings of the ACM/IEEE Joint Conference on Digital Libraries
ISSN (Print)	1552-5996

Abstract

Entity resolution is the task of identifying entities that refer to the same real-world object. It has important applications in the context of digital libraries, such as citation matching and author disambiguation. Blocking is an established methodology for efficiently addressing this problem; it clusters similar entities together, and compares solely entities inside each cluster. In order to effectively deal with the current large, noisy and heterogeneous data collections, novel blocking methods that rely on redundancy have been introduced: they associate each entity with multiple blocks in order to increase recall, thus increasing the computational cost, as well. In this paper, we introduce novel techniques that remove the superfluous comparisons from any redundancy-based blocking method. They improve the time-efficiency of the latter without any impact on the end result. We present the optimal solution to this problem that discards all redundant comparisons at the cost of quadratic space complexity. For applications with space limitations, we also present an alternative, lightweight solution that operates at the abstract level of blocks in order to discard a significant part of the redundant comparisons. We evaluate our techniques on two large, real-world data sets and verify the significant improvements they convey when integrated into existing blocking methods.

ASJC Scopus Sachgebiete

Ingenieurwesen (insg.)
Allgemeiner Maschinenbau

Zitieren

Eliminating the Redundancy in Blocking-based Entity Resolution Methods. / Papadakis, George; Ioannou, Ekaterini; Niederée, Claudia et al.
JCDL'11 - Proceedings of the 2011 ACM/IEEE Joint Conference on Digital Libraries. 2011. S. 85-94 (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review

Papadakis, G, Ioannou, E, Niederée, C, Palpanas, T & Nejdl, W 2011, Eliminating the Redundancy in Blocking-based Entity Resolution Methods. in JCDL'11 - Proceedings of the 2011 ACM/IEEE Joint Conference on Digital Libraries. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, S. 85-94, 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL'11, Ottawa, ON, Kanada, 13 Juni 2011. https://doi.org/10.1145/1998076.1998093

Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., & Nejdl, W. (2011). Eliminating the Redundancy in Blocking-based Entity Resolution Methods. In JCDL'11 - Proceedings of the 2011 ACM/IEEE Joint Conference on Digital Libraries (S. 85-94). (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries). https://doi.org/10.1145/1998076.1998093

Papadakis G, Ioannou E, Niederée C, Palpanas T, Nejdl W. Eliminating the Redundancy in Blocking-based Entity Resolution Methods. in JCDL'11 - Proceedings of the 2011 ACM/IEEE Joint Conference on Digital Libraries. 2011. S. 85-94. (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries). doi: 10.1145/1998076.1998093

Papadakis, George ; Ioannou, Ekaterini ; Niederée, Claudia et al. / Eliminating the Redundancy in Blocking-based Entity Resolution Methods. JCDL'11 - Proceedings of the 2011 ACM/IEEE Joint Conference on Digital Libraries. 2011. S. 85-94 (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries).

Download

@inproceedings{a57b5477e56c4549880d87408f0160fc,

title = "Eliminating the Redundancy in Blocking-based Entity Resolution Methods",

abstract = "Entity resolution is the task of identifying entities that refer to the same real-world object. It has important applications in the context of digital libraries, such as citation matching and author disambiguation. Blocking is an established methodology for efficiently addressing this problem; it clusters similar entities together, and compares solely entities inside each cluster. In order to effectively deal with the current large, noisy and heterogeneous data collections, novel blocking methods that rely on redundancy have been introduced: they associate each entity with multiple blocks in order to increase recall, thus increasing the computational cost, as well. In this paper, we introduce novel techniques that remove the superfluous comparisons from any redundancy-based blocking method. They improve the time-efficiency of the latter without any impact on the end result. We present the optimal solution to this problem that discards all redundant comparisons at the cost of quadratic space complexity. For applications with space limitations, we also present an alternative, lightweight solution that operates at the abstract level of blocks in order to discard a significant part of the redundant comparisons. We evaluate our techniques on two large, real-world data sets and verify the significant improvements they convey when integrated into existing blocking methods.",

keywords = "data cleaning, entity resolution, redundancy-based blocking",

author = "George Papadakis and Ekaterini Ioannou and Claudia Nieder{\'e}e and Themis Palpanas and Wolfgang Nejdl",

year = "2011",

month = jun,

day = "13",

doi = "10.1145/1998076.1998093",

language = "English",

isbn = "9781450307444",

series = "Proceedings of the ACM/IEEE Joint Conference on Digital Libraries",

pages = "85--94",

booktitle = "JCDL'11 - Proceedings of the 2011 ACM/IEEE Joint Conference on Digital Libraries",

note = "11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL'11 ; Conference date: 13-06-2011 Through 17-06-2011",

}

Download

TY - GEN

T1 - Eliminating the Redundancy in Blocking-based Entity Resolution Methods

AU - Papadakis, George

AU - Ioannou, Ekaterini

AU - Niederée, Claudia

AU - Palpanas, Themis

AU - Nejdl, Wolfgang

PY - 2011/6/13

Y1 - 2011/6/13

N2 - Entity resolution is the task of identifying entities that refer to the same real-world object. It has important applications in the context of digital libraries, such as citation matching and author disambiguation. Blocking is an established methodology for efficiently addressing this problem; it clusters similar entities together, and compares solely entities inside each cluster. In order to effectively deal with the current large, noisy and heterogeneous data collections, novel blocking methods that rely on redundancy have been introduced: they associate each entity with multiple blocks in order to increase recall, thus increasing the computational cost, as well. In this paper, we introduce novel techniques that remove the superfluous comparisons from any redundancy-based blocking method. They improve the time-efficiency of the latter without any impact on the end result. We present the optimal solution to this problem that discards all redundant comparisons at the cost of quadratic space complexity. For applications with space limitations, we also present an alternative, lightweight solution that operates at the abstract level of blocks in order to discard a significant part of the redundant comparisons. We evaluate our techniques on two large, real-world data sets and verify the significant improvements they convey when integrated into existing blocking methods.

AB - Entity resolution is the task of identifying entities that refer to the same real-world object. It has important applications in the context of digital libraries, such as citation matching and author disambiguation. Blocking is an established methodology for efficiently addressing this problem; it clusters similar entities together, and compares solely entities inside each cluster. In order to effectively deal with the current large, noisy and heterogeneous data collections, novel blocking methods that rely on redundancy have been introduced: they associate each entity with multiple blocks in order to increase recall, thus increasing the computational cost, as well. In this paper, we introduce novel techniques that remove the superfluous comparisons from any redundancy-based blocking method. They improve the time-efficiency of the latter without any impact on the end result. We present the optimal solution to this problem that discards all redundant comparisons at the cost of quadratic space complexity. For applications with space limitations, we also present an alternative, lightweight solution that operates at the abstract level of blocks in order to discard a significant part of the redundant comparisons. We evaluate our techniques on two large, real-world data sets and verify the significant improvements they convey when integrated into existing blocking methods.

KW - data cleaning

KW - entity resolution

KW - redundancy-based blocking

UR - http://www.scopus.com/inward/record.url?scp=79960519872&partnerID=8YFLogxK

U2 - 10.1145/1998076.1998093

DO - 10.1145/1998076.1998093

M3 - Conference contribution

AN - SCOPUS:79960519872

SN - 9781450307444

T3 - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries

SP - 85

EP - 94

BT - JCDL'11 - Proceedings of the 2011 ACM/IEEE Joint Conference on Digital Libraries

T2 - 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL'11

Y2 - 13 June 2011 through 17 June 2011

ER -

Research@Leibniz University

Eliminating the Redundancy in Blocking-based Entity Resolution Methods

Autoren

Organisationseinheiten

Externe Organisationen

Details

Publikationsreihe

Abstract

ASJC Scopus Sachgebiete

Zitieren

Von denselben Autoren

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

An artificial intelligence-assisted clinical framework to facilitate diagnostics and translational discovery in hematologic neoplasia