Details
Original language | English |
---|---|
Title of host publication | Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011 |
Publication status | Published - 12 Jun 2011 |
Event | 3rd International Workshop on Semantic Web Information Management, SWIM 2011 - Athens, Greece Duration: 12 Jun 2011 → 16 Jun 2011 |
Publication series
Name | Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011 |
---|
Abstract
Blocking methods are crucial for making the inherently quadratic task of Entity Resolution more efficient. The blocking methods proposed in the literature rely on the homogeneity of data and the availability of binding schema information; thus, they are inapplicable to the voluminous, noisy, and highly heterogeneous data of the Web 2.0 user-generated content. To deal with such data, attribute-agnostic blocking has been recently introduced, following a two-fold strategy: the first layer places entities into overlapping blocks in order to achieve high effectiveness, while the second layer reduces the number of unnecessary comparisons in order to enhance efficiency. In this paper, we present a set of techniques that can be plugged into the second strategy layer of attribute-agnostic blocking to further improve its efficiency. We introduce a technique that eliminates redundant comparisons, and, based on this, we incorporate an approximate method for pruning comparisons that are highly likely to involve non-matching entities. We also introduce a novel measure for quantifying the redundancy a blocking method entails and explain how it can be used to a-priori tune the process of comparisons pruning. We apply our blocking techniques on two large, real-world data sets and report results that demonstrate a substantial increase in efficiency at a negligible (if any) cost in effectiveness.
Keywords
- attribute-agnostic blocking, data cleaning, entity resolution
ASJC Scopus subject areas
- Computer Science(all)
- Computer Networks and Communications
- Decision Sciences(all)
- Information Systems and Management
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011. 2011. (Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - To Compare or Not to Compare
T2 - 3rd International Workshop on Semantic Web Information Management, SWIM 2011
AU - Papadakis, George
AU - Ioannou, Ekaterini
AU - Niederée, Claudia
AU - Palpanas, Themis
AU - Nejdl, Wolfgang
PY - 2011/6/12
Y1 - 2011/6/12
N2 - Blocking methods are crucial for making the inherently quadratic task of Entity Resolution more efficient. The blocking methods proposed in the literature rely on the homogeneity of data and the availability of binding schema information; thus, they are inapplicable to the voluminous, noisy, and highly heterogeneous data of the Web 2.0 user-generated content. To deal with such data, attribute-agnostic blocking has been recently introduced, following a two-fold strategy: the first layer places entities into overlapping blocks in order to achieve high effectiveness, while the second layer reduces the number of unnecessary comparisons in order to enhance efficiency. In this paper, we present a set of techniques that can be plugged into the second strategy layer of attribute-agnostic blocking to further improve its efficiency. We introduce a technique that eliminates redundant comparisons, and, based on this, we incorporate an approximate method for pruning comparisons that are highly likely to involve non-matching entities. We also introduce a novel measure for quantifying the redundancy a blocking method entails and explain how it can be used to a-priori tune the process of comparisons pruning. We apply our blocking techniques on two large, real-world data sets and report results that demonstrate a substantial increase in efficiency at a negligible (if any) cost in effectiveness.
AB - Blocking methods are crucial for making the inherently quadratic task of Entity Resolution more efficient. The blocking methods proposed in the literature rely on the homogeneity of data and the availability of binding schema information; thus, they are inapplicable to the voluminous, noisy, and highly heterogeneous data of the Web 2.0 user-generated content. To deal with such data, attribute-agnostic blocking has been recently introduced, following a two-fold strategy: the first layer places entities into overlapping blocks in order to achieve high effectiveness, while the second layer reduces the number of unnecessary comparisons in order to enhance efficiency. In this paper, we present a set of techniques that can be plugged into the second strategy layer of attribute-agnostic blocking to further improve its efficiency. We introduce a technique that eliminates redundant comparisons, and, based on this, we incorporate an approximate method for pruning comparisons that are highly likely to involve non-matching entities. We also introduce a novel measure for quantifying the redundancy a blocking method entails and explain how it can be used to a-priori tune the process of comparisons pruning. We apply our blocking techniques on two large, real-world data sets and report results that demonstrate a substantial increase in efficiency at a negligible (if any) cost in effectiveness.
KW - attribute-agnostic blocking
KW - data cleaning
KW - entity resolution
UR - http://www.scopus.com/inward/record.url?scp=79960675264&partnerID=8YFLogxK
U2 - 10.1145/1999299.1999302
DO - 10.1145/1999299.1999302
M3 - Conference contribution
SN - 9781450306515
T3 - Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011
BT - Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011
Y2 - 12 June 2011 through 16 June 2011
ER -