A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Autoren

  • George Papadakis
  • Ekaterini Ioannou
  • Themis Palpanas
  • Claudia Niederee
  • Wolfgang Nejdl

Organisationseinheiten

Externe Organisationen

  • Technical University of Crete
  • Università degli Studi di Trento
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Aufsatznummer6255742
Seiten (von - bis)2665-2682
Seitenumfang18
FachzeitschriftIEEE Transactions on Knowledge and Data Engineering
Jahrgang25
Ausgabenummer12
PublikationsstatusVeröffentlicht - 31 Juli 2012

Abstract

In the context of entity resolution (ER) in highly heterogeneous, noisy, user-generated entity collections, practically all block building methods employ redundancy to achieve high effectiveness. This practice, however, results in a high number of pairwise comparisons, with a negative impact on efficiency. Existing block processing strategies aim at discarding unnecessary comparisons at no cost in effectiveness. In this paper, we systemize blocking methods for clean-clean ER (an inherently quadratic task) over highly heterogeneous information spaces (HHIS) through a novel framework that consists of two orthogonal layers: the effectiveness layer encompasses methods for building overlapping blocks with small likelihood of missed matches; the efficiency layer comprises a rich variety of techniques that significantly restrict the required number of pairwise comparisons, having a controllable impact on the number of detected duplicates. We map to our framework all relevant existing methods for creating and processing blocks in the context of HHIS, and additionally propose two novel techniques: attribute clustering blocking and comparison scheduling. We evaluate the performance of each layer and method on two large-scale, real-world data sets and validate the excellent balance between efficiency and effectiveness that they achieve.

ASJC Scopus Sachgebiete

Zitieren

A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces. / Papadakis, George; Ioannou, Ekaterini; Palpanas, Themis et al.
in: IEEE Transactions on Knowledge and Data Engineering, Jahrgang 25, Nr. 12, 6255742, 31.07.2012, S. 2665-2682.

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Papadakis G, Ioannou E, Palpanas T, Niederee C, Nejdl W. A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces. IEEE Transactions on Knowledge and Data Engineering. 2012 Jul 31;25(12):2665-2682. 6255742. doi: 10.1109/TKDE.2012.150
Papadakis, George ; Ioannou, Ekaterini ; Palpanas, Themis et al. / A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces. in: IEEE Transactions on Knowledge and Data Engineering. 2012 ; Jahrgang 25, Nr. 12. S. 2665-2682.
Download
@article{31fa00735cbe4676afd0849b4207ae21,
title = "A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces",
abstract = "In the context of entity resolution (ER) in highly heterogeneous, noisy, user-generated entity collections, practically all block building methods employ redundancy to achieve high effectiveness. This practice, however, results in a high number of pairwise comparisons, with a negative impact on efficiency. Existing block processing strategies aim at discarding unnecessary comparisons at no cost in effectiveness. In this paper, we systemize blocking methods for clean-clean ER (an inherently quadratic task) over highly heterogeneous information spaces (HHIS) through a novel framework that consists of two orthogonal layers: the effectiveness layer encompasses methods for building overlapping blocks with small likelihood of missed matches; the efficiency layer comprises a rich variety of techniques that significantly restrict the required number of pairwise comparisons, having a controllable impact on the number of detected duplicates. We map to our framework all relevant existing methods for creating and processing blocks in the context of HHIS, and additionally propose two novel techniques: attribute clustering blocking and comparison scheduling. We evaluate the performance of each layer and method on two large-scale, real-world data sets and validate the excellent balance between efficiency and effectiveness that they achieve.",
keywords = "blocking methods, entity resolution, Information integration",
author = "George Papadakis and Ekaterini Ioannou and Themis Palpanas and Claudia Niederee and Wolfgang Nejdl",
year = "2012",
month = jul,
day = "31",
doi = "10.1109/TKDE.2012.150",
language = "English",
volume = "25",
pages = "2665--2682",
journal = "IEEE Transactions on Knowledge and Data Engineering",
issn = "1041-4347",
publisher = "IEEE Computer Society",
number = "12",

}

Download

TY - JOUR

T1 - A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces

AU - Papadakis, George

AU - Ioannou, Ekaterini

AU - Palpanas, Themis

AU - Niederee, Claudia

AU - Nejdl, Wolfgang

PY - 2012/7/31

Y1 - 2012/7/31

N2 - In the context of entity resolution (ER) in highly heterogeneous, noisy, user-generated entity collections, practically all block building methods employ redundancy to achieve high effectiveness. This practice, however, results in a high number of pairwise comparisons, with a negative impact on efficiency. Existing block processing strategies aim at discarding unnecessary comparisons at no cost in effectiveness. In this paper, we systemize blocking methods for clean-clean ER (an inherently quadratic task) over highly heterogeneous information spaces (HHIS) through a novel framework that consists of two orthogonal layers: the effectiveness layer encompasses methods for building overlapping blocks with small likelihood of missed matches; the efficiency layer comprises a rich variety of techniques that significantly restrict the required number of pairwise comparisons, having a controllable impact on the number of detected duplicates. We map to our framework all relevant existing methods for creating and processing blocks in the context of HHIS, and additionally propose two novel techniques: attribute clustering blocking and comparison scheduling. We evaluate the performance of each layer and method on two large-scale, real-world data sets and validate the excellent balance between efficiency and effectiveness that they achieve.

AB - In the context of entity resolution (ER) in highly heterogeneous, noisy, user-generated entity collections, practically all block building methods employ redundancy to achieve high effectiveness. This practice, however, results in a high number of pairwise comparisons, with a negative impact on efficiency. Existing block processing strategies aim at discarding unnecessary comparisons at no cost in effectiveness. In this paper, we systemize blocking methods for clean-clean ER (an inherently quadratic task) over highly heterogeneous information spaces (HHIS) through a novel framework that consists of two orthogonal layers: the effectiveness layer encompasses methods for building overlapping blocks with small likelihood of missed matches; the efficiency layer comprises a rich variety of techniques that significantly restrict the required number of pairwise comparisons, having a controllable impact on the number of detected duplicates. We map to our framework all relevant existing methods for creating and processing blocks in the context of HHIS, and additionally propose two novel techniques: attribute clustering blocking and comparison scheduling. We evaluate the performance of each layer and method on two large-scale, real-world data sets and validate the excellent balance between efficiency and effectiveness that they achieve.

KW - blocking methods

KW - entity resolution

KW - Information integration

UR - http://www.scopus.com/inward/record.url?scp=84887673907&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2012.150

DO - 10.1109/TKDE.2012.150

M3 - Article

AN - SCOPUS:84887673907

VL - 25

SP - 2665

EP - 2682

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

SN - 1041-4347

IS - 12

M1 - 6255742

ER -

Von denselben Autoren