To Compare or Not to Compare: Making Entity Resolution more Efﬁcient

George Papadakis; Ekaterini Ioannou; Claudia Niederée; Themis Palpanas; Wolfgang Nejdl

doi:10.1145/1999299.1999302

Details

Originalsprache	Englisch
Titel des Sammelwerks	Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011
Publikationsstatus	Veröffentlicht - 12 Juni 2011
Veranstaltung	3rd International Workshop on Semantic Web Information Management, SWIM 2011 - Athens, Griechenland Dauer: 12 Juni 2011 → 16 Juni 2011

Publikationsreihe

Name	Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011

Abstract

Blocking methods are crucial for making the inherently quadratic task of Entity Resolution more efficient. The blocking methods proposed in the literature rely on the homogeneity of data and the availability of binding schema information; thus, they are inapplicable to the voluminous, noisy, and highly heterogeneous data of the Web 2.0 user-generated content. To deal with such data, attribute-agnostic blocking has been recently introduced, following a two-fold strategy: the first layer places entities into overlapping blocks in order to achieve high effectiveness, while the second layer reduces the number of unnecessary comparisons in order to enhance efficiency. In this paper, we present a set of techniques that can be plugged into the second strategy layer of attribute-agnostic blocking to further improve its efficiency. We introduce a technique that eliminates redundant comparisons, and, based on this, we incorporate an approximate method for pruning comparisons that are highly likely to involve non-matching entities. We also introduce a novel measure for quantifying the redundancy a blocking method entails and explain how it can be used to a-priori tune the process of comparisons pruning. We apply our blocking techniques on two large, real-world data sets and report results that demonstrate a substantial increase in efficiency at a negligible (if any) cost in effectiveness.

ASJC Scopus Sachgebiete

Informatik (insg.)
Computernetzwerke und -kommunikation
Entscheidungswissenschaften (insg.)
Informationssysteme und -management

Zitieren

To Compare or Not to Compare: Making Entity Resolution more Efﬁcient. / Papadakis, George; Ioannou, Ekaterini; Niederée, Claudia et al.
Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011. 2011. (Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review

Papadakis, G, Ioannou, E, Niederée, C, Palpanas, T & Nejdl, W 2011, To Compare or Not to Compare: Making Entity Resolution more Efﬁcient. in Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011. Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011, 3rd International Workshop on Semantic Web Information Management, SWIM 2011, Athens, Griechenland, 12 Juni 2011. https://doi.org/10.1145/1999299.1999302

Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., & Nejdl, W. (2011). To Compare or Not to Compare: Making Entity Resolution more Efﬁcient. In Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011 (Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011). https://doi.org/10.1145/1999299.1999302

Papadakis G, Ioannou E, Niederée C, Palpanas T, Nejdl W. To Compare or Not to Compare: Making Entity Resolution more Efﬁcient. in Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011. 2011. (Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011). doi: 10.1145/1999299.1999302

Papadakis, George ; Ioannou, Ekaterini ; Niederée, Claudia et al. / To Compare or Not to Compare : Making Entity Resolution more Efﬁcient. Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011. 2011. (Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011).

Download

@inproceedings{ec3583dbd7c14f499a42cf47253637e6,

title = "To Compare or Not to Compare: Making Entity Resolution more Efﬁcient",

abstract = "Blocking methods are crucial for making the inherently quadratic task of Entity Resolution more efficient. The blocking methods proposed in the literature rely on the homogeneity of data and the availability of binding schema information; thus, they are inapplicable to the voluminous, noisy, and highly heterogeneous data of the Web 2.0 user-generated content. To deal with such data, attribute-agnostic blocking has been recently introduced, following a two-fold strategy: the first layer places entities into overlapping blocks in order to achieve high effectiveness, while the second layer reduces the number of unnecessary comparisons in order to enhance efficiency. In this paper, we present a set of techniques that can be plugged into the second strategy layer of attribute-agnostic blocking to further improve its efficiency. We introduce a technique that eliminates redundant comparisons, and, based on this, we incorporate an approximate method for pruning comparisons that are highly likely to involve non-matching entities. We also introduce a novel measure for quantifying the redundancy a blocking method entails and explain how it can be used to a-priori tune the process of comparisons pruning. We apply our blocking techniques on two large, real-world data sets and report results that demonstrate a substantial increase in efficiency at a negligible (if any) cost in effectiveness.",

keywords = "attribute-agnostic blocking, data cleaning, entity resolution",

author = "George Papadakis and Ekaterini Ioannou and Claudia Nieder{\'e}e and Themis Palpanas and Wolfgang Nejdl",

year = "2011",

month = jun,

day = "12",

doi = "10.1145/1999299.1999302",

language = "English",

isbn = "9781450306515",

series = "Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011",

booktitle = "Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011",

note = "3rd International Workshop on Semantic Web Information Management, SWIM 2011 ; Conference date: 12-06-2011 Through 16-06-2011",

}

Download

TY - GEN

T1 - To Compare or Not to Compare

T2 - 3rd International Workshop on Semantic Web Information Management, SWIM 2011

AU - Papadakis, George

AU - Ioannou, Ekaterini

AU - Niederée, Claudia

AU - Palpanas, Themis

AU - Nejdl, Wolfgang

PY - 2011/6/12

Y1 - 2011/6/12

N2 - Blocking methods are crucial for making the inherently quadratic task of Entity Resolution more efficient. The blocking methods proposed in the literature rely on the homogeneity of data and the availability of binding schema information; thus, they are inapplicable to the voluminous, noisy, and highly heterogeneous data of the Web 2.0 user-generated content. To deal with such data, attribute-agnostic blocking has been recently introduced, following a two-fold strategy: the first layer places entities into overlapping blocks in order to achieve high effectiveness, while the second layer reduces the number of unnecessary comparisons in order to enhance efficiency. In this paper, we present a set of techniques that can be plugged into the second strategy layer of attribute-agnostic blocking to further improve its efficiency. We introduce a technique that eliminates redundant comparisons, and, based on this, we incorporate an approximate method for pruning comparisons that are highly likely to involve non-matching entities. We also introduce a novel measure for quantifying the redundancy a blocking method entails and explain how it can be used to a-priori tune the process of comparisons pruning. We apply our blocking techniques on two large, real-world data sets and report results that demonstrate a substantial increase in efficiency at a negligible (if any) cost in effectiveness.

AB - Blocking methods are crucial for making the inherently quadratic task of Entity Resolution more efficient. The blocking methods proposed in the literature rely on the homogeneity of data and the availability of binding schema information; thus, they are inapplicable to the voluminous, noisy, and highly heterogeneous data of the Web 2.0 user-generated content. To deal with such data, attribute-agnostic blocking has been recently introduced, following a two-fold strategy: the first layer places entities into overlapping blocks in order to achieve high effectiveness, while the second layer reduces the number of unnecessary comparisons in order to enhance efficiency. In this paper, we present a set of techniques that can be plugged into the second strategy layer of attribute-agnostic blocking to further improve its efficiency. We introduce a technique that eliminates redundant comparisons, and, based on this, we incorporate an approximate method for pruning comparisons that are highly likely to involve non-matching entities. We also introduce a novel measure for quantifying the redundancy a blocking method entails and explain how it can be used to a-priori tune the process of comparisons pruning. We apply our blocking techniques on two large, real-world data sets and report results that demonstrate a substantial increase in efficiency at a negligible (if any) cost in effectiveness.

KW - attribute-agnostic blocking

KW - data cleaning

KW - entity resolution

UR - http://www.scopus.com/inward/record.url?scp=79960675264&partnerID=8YFLogxK

U2 - 10.1145/1999299.1999302

DO - 10.1145/1999299.1999302

M3 - Conference contribution

SN - 9781450306515

T3 - Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011

BT - Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011

Y2 - 12 June 2011 through 16 June 2011

ER -

Research@Leibniz University

To Compare or Not to Compare: Making Entity Resolution more Efﬁcient

Autorschaft

Organisationseinheiten

Externe Organisationen

Details

Publikationsreihe

Abstract

ASJC Scopus Sachgebiete

Zitieren

Von denselben Autoren

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions