Details
Originalsprache | Englisch |
---|---|
Aufsatznummer | 6255742 |
Seiten (von - bis) | 2665-2682 |
Seitenumfang | 18 |
Fachzeitschrift | IEEE Transactions on Knowledge and Data Engineering |
Jahrgang | 25 |
Ausgabenummer | 12 |
Publikationsstatus | Veröffentlicht - 31 Juli 2012 |
Abstract
In the context of entity resolution (ER) in highly heterogeneous, noisy, user-generated entity collections, practically all block building methods employ redundancy to achieve high effectiveness. This practice, however, results in a high number of pairwise comparisons, with a negative impact on efficiency. Existing block processing strategies aim at discarding unnecessary comparisons at no cost in effectiveness. In this paper, we systemize blocking methods for clean-clean ER (an inherently quadratic task) over highly heterogeneous information spaces (HHIS) through a novel framework that consists of two orthogonal layers: the effectiveness layer encompasses methods for building overlapping blocks with small likelihood of missed matches; the efficiency layer comprises a rich variety of techniques that significantly restrict the required number of pairwise comparisons, having a controllable impact on the number of detected duplicates. We map to our framework all relevant existing methods for creating and processing blocks in the context of HHIS, and additionally propose two novel techniques: attribute clustering blocking and comparison scheduling. We evaluate the performance of each layer and method on two large-scale, real-world data sets and validate the excellent balance between efficiency and effectiveness that they achieve.
ASJC Scopus Sachgebiete
- Informatik (insg.)
- Information systems
- Informatik (insg.)
- Angewandte Informatik
- Informatik (insg.)
- Theoretische Informatik und Mathematik
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
in: IEEE Transactions on Knowledge and Data Engineering, Jahrgang 25, Nr. 12, 6255742, 31.07.2012, S. 2665-2682.
Publikation: Beitrag in Fachzeitschrift › Artikel › Forschung › Peer-Review
}
TY - JOUR
T1 - A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces
AU - Papadakis, George
AU - Ioannou, Ekaterini
AU - Palpanas, Themis
AU - Niederee, Claudia
AU - Nejdl, Wolfgang
PY - 2012/7/31
Y1 - 2012/7/31
N2 - In the context of entity resolution (ER) in highly heterogeneous, noisy, user-generated entity collections, practically all block building methods employ redundancy to achieve high effectiveness. This practice, however, results in a high number of pairwise comparisons, with a negative impact on efficiency. Existing block processing strategies aim at discarding unnecessary comparisons at no cost in effectiveness. In this paper, we systemize blocking methods for clean-clean ER (an inherently quadratic task) over highly heterogeneous information spaces (HHIS) through a novel framework that consists of two orthogonal layers: the effectiveness layer encompasses methods for building overlapping blocks with small likelihood of missed matches; the efficiency layer comprises a rich variety of techniques that significantly restrict the required number of pairwise comparisons, having a controllable impact on the number of detected duplicates. We map to our framework all relevant existing methods for creating and processing blocks in the context of HHIS, and additionally propose two novel techniques: attribute clustering blocking and comparison scheduling. We evaluate the performance of each layer and method on two large-scale, real-world data sets and validate the excellent balance between efficiency and effectiveness that they achieve.
AB - In the context of entity resolution (ER) in highly heterogeneous, noisy, user-generated entity collections, practically all block building methods employ redundancy to achieve high effectiveness. This practice, however, results in a high number of pairwise comparisons, with a negative impact on efficiency. Existing block processing strategies aim at discarding unnecessary comparisons at no cost in effectiveness. In this paper, we systemize blocking methods for clean-clean ER (an inherently quadratic task) over highly heterogeneous information spaces (HHIS) through a novel framework that consists of two orthogonal layers: the effectiveness layer encompasses methods for building overlapping blocks with small likelihood of missed matches; the efficiency layer comprises a rich variety of techniques that significantly restrict the required number of pairwise comparisons, having a controllable impact on the number of detected duplicates. We map to our framework all relevant existing methods for creating and processing blocks in the context of HHIS, and additionally propose two novel techniques: attribute clustering blocking and comparison scheduling. We evaluate the performance of each layer and method on two large-scale, real-world data sets and validate the excellent balance between efficiency and effectiveness that they achieve.
KW - blocking methods
KW - entity resolution
KW - Information integration
UR - http://www.scopus.com/inward/record.url?scp=84887673907&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2012.150
DO - 10.1109/TKDE.2012.150
M3 - Article
AN - SCOPUS:84887673907
VL - 25
SP - 2665
EP - 2682
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
SN - 1041-4347
IS - 12
M1 - 6255742
ER -