Meta-Blockung: Taking Entity Resolution to the Next Level

George Papadakis; Georgia Koutrika; Themis Palpanas; Wolfgang Nejdl

doi:10.1109/tkde.2013.54

Details

Originalsprache	Englisch
Aufsatznummer	6487505
Seiten (von - bis)	1946-1960
Seitenumfang	15
Fachzeitschrift	IEEE Transactions on Knowledge and Data Engineering
Jahrgang	26
Ausgabenummer	8
Publikationsstatus	Veröffentlicht - Aug. 2014

Abstract

Entity Resolution is an inherently quadratic task that typically scales to large data collections through blocking. In the context of highly heterogeneous information spaces, blocking methods rely on redundancy in order to ensure high effectiveness at the cost of lower efficiency (i.e., more comparisons). This effect is partially ameliorated by coarse-grained block processing techniques that discard entire blocks either a-priori or during the resolution process. In this paper, we introduce meta-blocking as a generic procedure that intervenes between the creation and the processing of blocks, transforming an initial set of blocks into a new one with substantially fewer comparisons and equally high effectiveness. In essence, meta-blocking aims at extracting the most similar pairs of entities by leveraging the information that is encapsulated in the block-to-entity relationships. To this end, it first builds an abstract graph representation of the original set of blocks, with the nodes corresponding to entity profiles and the edges connecting the co-occurring ones. During the creation of this structure all redundant comparisons are discarded, while the superfluous ones can be removed by pruning of the edges with the lowest weight. We analytically examine both procedures, proposing a multitude of edge weighting schemes, graph pruning algorithms as well as pruning criteria. Our approaches are schema-agnostic, thus accommodating any type of blocks. We evaluate their performance through a thorough experimental study over three large-scale, real-world data sets, with the outcomes verifying significant efficiency enhancements at a negligible cost in effectiveness.

ASJC Scopus Sachgebiete

Informatik (insg.)
Information systems
Informatik (insg.)
Angewandte Informatik
Informatik (insg.)
Theoretische Informatik und Mathematik

Zitieren

Meta-Blockung: Taking Entity Resolution to the Next Level. / Papadakis, George; Koutrika, Georgia; Palpanas, Themis et al.
in: IEEE Transactions on Knowledge and Data Engineering, Jahrgang 26, Nr. 8, 6487505, 08.2014, S. 1946-1960.

Publikation: Beitrag in Fachzeitschrift › Artikel › Forschung › Peer-Review

Papadakis, G, Koutrika, G, Palpanas, T & Nejdl, W 2014, 'Meta-Blockung: Taking Entity Resolution to the Next Level', IEEE Transactions on Knowledge and Data Engineering, Jg. 26, Nr. 8, 6487505, S. 1946-1960. https://doi.org/10.1109/tkde.2013.54

Papadakis, G., Koutrika, G., Palpanas, T., & Nejdl, W. (2014). Meta-Blockung: Taking Entity Resolution to the Next Level. IEEE Transactions on Knowledge and Data Engineering, 26(8), 1946-1960. Artikel 6487505. https://doi.org/10.1109/tkde.2013.54

Papadakis G, Koutrika G, Palpanas T, Nejdl W. Meta-Blockung: Taking Entity Resolution to the Next Level. IEEE Transactions on Knowledge and Data Engineering. 2014 Aug;26(8):1946-1960. 6487505. doi: 10.1109/tkde.2013.54

Papadakis, George ; Koutrika, Georgia ; Palpanas, Themis et al. / Meta-Blockung: Taking Entity Resolution to the Next Level. in: IEEE Transactions on Knowledge and Data Engineering. 2014 ; Jahrgang 26, Nr. 8. S. 1946-1960.

Download

@article{a2408ba68dd248cda402541feba0b634,

title = "Meta-Blockung: Taking Entity Resolution to the Next Level",

abstract = "Entity Resolution is an inherently quadratic task that typically scales to large data collections through blocking. In the context of highly heterogeneous information spaces, blocking methods rely on redundancy in order to ensure high effectiveness at the cost of lower efficiency (i.e., more comparisons). This effect is partially ameliorated by coarse-grained block processing techniques that discard entire blocks either a-priori or during the resolution process. In this paper, we introduce meta-blocking as a generic procedure that intervenes between the creation and the processing of blocks, transforming an initial set of blocks into a new one with substantially fewer comparisons and equally high effectiveness. In essence, meta-blocking aims at extracting the most similar pairs of entities by leveraging the information that is encapsulated in the block-to-entity relationships. To this end, it first builds an abstract graph representation of the original set of blocks, with the nodes corresponding to entity profiles and the edges connecting the co-occurring ones. During the creation of this structure all redundant comparisons are discarded, while the superfluous ones can be removed by pruning of the edges with the lowest weight. We analytically examine both procedures, proposing a multitude of edge weighting schemes, graph pruning algorithms as well as pruning criteria. Our approaches are schema-agnostic, thus accommodating any type of blocks. We evaluate their performance through a thorough experimental study over three large-scale, real-world data sets, with the outcomes verifying significant efficiency enhancements at a negligible cost in effectiveness.",

keywords = "Entity resolution, meta-blocking, redundancy-positive blocking",

author = "George Papadakis and Georgia Koutrika and Themis Palpanas and Wolfgang Nejdl",

year = "2014",

month = aug,

doi = "10.1109/tkde.2013.54",

language = "English",

volume = "26",

pages = "1946--1960",

journal = "IEEE Transactions on Knowledge and Data Engineering",

issn = "1041-4347",

publisher = "IEEE Computer Society",

number = "8",

}

Download

TY - JOUR

T1 - Meta-Blockung: Taking Entity Resolution to the Next Level

AU - Papadakis, George

AU - Koutrika, Georgia

AU - Palpanas, Themis

AU - Nejdl, Wolfgang

PY - 2014/8

Y1 - 2014/8

N2 - Entity Resolution is an inherently quadratic task that typically scales to large data collections through blocking. In the context of highly heterogeneous information spaces, blocking methods rely on redundancy in order to ensure high effectiveness at the cost of lower efficiency (i.e., more comparisons). This effect is partially ameliorated by coarse-grained block processing techniques that discard entire blocks either a-priori or during the resolution process. In this paper, we introduce meta-blocking as a generic procedure that intervenes between the creation and the processing of blocks, transforming an initial set of blocks into a new one with substantially fewer comparisons and equally high effectiveness. In essence, meta-blocking aims at extracting the most similar pairs of entities by leveraging the information that is encapsulated in the block-to-entity relationships. To this end, it first builds an abstract graph representation of the original set of blocks, with the nodes corresponding to entity profiles and the edges connecting the co-occurring ones. During the creation of this structure all redundant comparisons are discarded, while the superfluous ones can be removed by pruning of the edges with the lowest weight. We analytically examine both procedures, proposing a multitude of edge weighting schemes, graph pruning algorithms as well as pruning criteria. Our approaches are schema-agnostic, thus accommodating any type of blocks. We evaluate their performance through a thorough experimental study over three large-scale, real-world data sets, with the outcomes verifying significant efficiency enhancements at a negligible cost in effectiveness.

AB - Entity Resolution is an inherently quadratic task that typically scales to large data collections through blocking. In the context of highly heterogeneous information spaces, blocking methods rely on redundancy in order to ensure high effectiveness at the cost of lower efficiency (i.e., more comparisons). This effect is partially ameliorated by coarse-grained block processing techniques that discard entire blocks either a-priori or during the resolution process. In this paper, we introduce meta-blocking as a generic procedure that intervenes between the creation and the processing of blocks, transforming an initial set of blocks into a new one with substantially fewer comparisons and equally high effectiveness. In essence, meta-blocking aims at extracting the most similar pairs of entities by leveraging the information that is encapsulated in the block-to-entity relationships. To this end, it first builds an abstract graph representation of the original set of blocks, with the nodes corresponding to entity profiles and the edges connecting the co-occurring ones. During the creation of this structure all redundant comparisons are discarded, while the superfluous ones can be removed by pruning of the edges with the lowest weight. We analytically examine both procedures, proposing a multitude of edge weighting schemes, graph pruning algorithms as well as pruning criteria. Our approaches are schema-agnostic, thus accommodating any type of blocks. We evaluate their performance through a thorough experimental study over three large-scale, real-world data sets, with the outcomes verifying significant efficiency enhancements at a negligible cost in effectiveness.

KW - Entity resolution

KW - meta-blocking

KW - redundancy-positive blocking

UR - http://www.scopus.com/inward/record.url?scp=84904650785&partnerID=8YFLogxK

U2 - 10.1109/tkde.2013.54

DO - 10.1109/tkde.2013.54

M3 - Article

AN - SCOPUS:84904650785

VL - 26

SP - 1946

EP - 1960

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

SN - 1041-4347

IS - 8

M1 - 6487505

ER -

Research@Leibniz University

Meta-Blockung: Taking Entity Resolution to the Next Level

Autorschaft

Organisationseinheiten

Externe Organisationen

Details

Abstract

ASJC Scopus Sachgebiete

Zitieren

Von denselben Autoren

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets