Details
Original language | English |
---|---|
Title of host publication | WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining |
Pages | 53-62 |
Number of pages | 10 |
Publication status | Published - 8 Feb 2012 |
Event | 5th ACM International Conference on Web Search and Data Mining, WSDM 2012 - Seattle, WA, United States Duration: 8 Feb 2012 → 12 Feb 2012 |
Publication series
Name | WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining |
---|
Abstract
A prerequisite for leveraging the vast amount of data available on the Web is Entity Resolution, i.e., the process of identifying and linking data that describe the same real-world objects. To make this inherently quadratic process applicable to large data sets, blocking is typically employed: entities (records) are grouped into clusters - the blocks - of matching candidates and only entities of the same block are compared. However, novel blocking techniques are required for dealing with the noisy, heterogeneous, semi-structured, user-generated data in the Web, as traditional blocking techniques are inapplicable due to their reliance on schema information. The introduction of redundancy, improves the robustness of blocking methods but comes at the price of additional computational cost. In this paper, we present methods for enhancing the eficiency of redundancy-bearing blocking methods, such as our attributeagnostic blocking approach. We introduce novel blocking schemes that build blocks based on a variety of evidences, including entity identifiers and relationships between entities; they significantly reduce the required number of comparisons, while maintaining blocking effectiveness at very high levels. We also introduce two theoretical measures that provide a reliable estimation of the performance of a blocking method, without requiring the analytical processing of its blocks. Based on these measures, we develop two techniques for improving the performance of blocking: combining individual, complementary blocking schemes, and purging blocks until given criteria are satisfied. We test our methods through an extensive experimental evaluation, using a voluminous data set with 182 million heterogeneous entities. The outcomes of our study show the applicability and the high performance of our approach.
Keywords
- Attribute-agnostic blocking, Data cleaning, Entity resolution
ASJC Scopus subject areas
- Computer Science(all)
- Computer Networks and Communications
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining. 2012. p. 53-62 (WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Beyond 100 Million Entities
T2 - 5th ACM International Conference on Web Search and Data Mining, WSDM 2012
AU - Papadakis, George
AU - Ioannou, Ekaterini
AU - Niederée, Claudia
AU - Palpanas, Themis
AU - Nejdl, Wolfgang
PY - 2012/2/8
Y1 - 2012/2/8
N2 - A prerequisite for leveraging the vast amount of data available on the Web is Entity Resolution, i.e., the process of identifying and linking data that describe the same real-world objects. To make this inherently quadratic process applicable to large data sets, blocking is typically employed: entities (records) are grouped into clusters - the blocks - of matching candidates and only entities of the same block are compared. However, novel blocking techniques are required for dealing with the noisy, heterogeneous, semi-structured, user-generated data in the Web, as traditional blocking techniques are inapplicable due to their reliance on schema information. The introduction of redundancy, improves the robustness of blocking methods but comes at the price of additional computational cost. In this paper, we present methods for enhancing the eficiency of redundancy-bearing blocking methods, such as our attributeagnostic blocking approach. We introduce novel blocking schemes that build blocks based on a variety of evidences, including entity identifiers and relationships between entities; they significantly reduce the required number of comparisons, while maintaining blocking effectiveness at very high levels. We also introduce two theoretical measures that provide a reliable estimation of the performance of a blocking method, without requiring the analytical processing of its blocks. Based on these measures, we develop two techniques for improving the performance of blocking: combining individual, complementary blocking schemes, and purging blocks until given criteria are satisfied. We test our methods through an extensive experimental evaluation, using a voluminous data set with 182 million heterogeneous entities. The outcomes of our study show the applicability and the high performance of our approach.
AB - A prerequisite for leveraging the vast amount of data available on the Web is Entity Resolution, i.e., the process of identifying and linking data that describe the same real-world objects. To make this inherently quadratic process applicable to large data sets, blocking is typically employed: entities (records) are grouped into clusters - the blocks - of matching candidates and only entities of the same block are compared. However, novel blocking techniques are required for dealing with the noisy, heterogeneous, semi-structured, user-generated data in the Web, as traditional blocking techniques are inapplicable due to their reliance on schema information. The introduction of redundancy, improves the robustness of blocking methods but comes at the price of additional computational cost. In this paper, we present methods for enhancing the eficiency of redundancy-bearing blocking methods, such as our attributeagnostic blocking approach. We introduce novel blocking schemes that build blocks based on a variety of evidences, including entity identifiers and relationships between entities; they significantly reduce the required number of comparisons, while maintaining blocking effectiveness at very high levels. We also introduce two theoretical measures that provide a reliable estimation of the performance of a blocking method, without requiring the analytical processing of its blocks. Based on these measures, we develop two techniques for improving the performance of blocking: combining individual, complementary blocking schemes, and purging blocks until given criteria are satisfied. We test our methods through an extensive experimental evaluation, using a voluminous data set with 182 million heterogeneous entities. The outcomes of our study show the applicability and the high performance of our approach.
KW - Attribute-agnostic blocking
KW - Data cleaning
KW - Entity resolution
UR - http://www.scopus.com/inward/record.url?scp=84858041897&partnerID=8YFLogxK
U2 - 10.1145/2124295.2124305
DO - 10.1145/2124295.2124305
M3 - Conference contribution
AN - SCOPUS:84858041897
SN - 9781450307475
T3 - WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining
SP - 53
EP - 62
BT - WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining
Y2 - 8 February 2012 through 12 February 2012
ER -