Beyond 100 Million Entities: Large-scale Blocking-based Resolution for Heterogeneous Data

George Papadakis; Ekaterini Ioannou; Claudia Niederée; Themis Palpanas; Wolfgang Nejdl

doi:10.1145/2124295.2124305

Details

Original language	English
Title of host publication	WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining
Pages	53-62
Number of pages	10
Publication status	Published - 8 Feb 2012
Event	5th ACM International Conference on Web Search and Data Mining, WSDM 2012 - Seattle, WA, United States Duration: 8 Feb 2012 → 12 Feb 2012

Publication series

Name	WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining

Abstract

A prerequisite for leveraging the vast amount of data available on the Web is Entity Resolution, i.e., the process of identifying and linking data that describe the same real-world objects. To make this inherently quadratic process applicable to large data sets, blocking is typically employed: entities (records) are grouped into clusters - the blocks - of matching candidates and only entities of the same block are compared. However, novel blocking techniques are required for dealing with the noisy, heterogeneous, semi-structured, user-generated data in the Web, as traditional blocking techniques are inapplicable due to their reliance on schema information. The introduction of redundancy, improves the robustness of blocking methods but comes at the price of additional computational cost. In this paper, we present methods for enhancing the eficiency of redundancy-bearing blocking methods, such as our attributeagnostic blocking approach. We introduce novel blocking schemes that build blocks based on a variety of evidences, including entity identifiers and relationships between entities; they significantly reduce the required number of comparisons, while maintaining blocking effectiveness at very high levels. We also introduce two theoretical measures that provide a reliable estimation of the performance of a blocking method, without requiring the analytical processing of its blocks. Based on these measures, we develop two techniques for improving the performance of blocking: combining individual, complementary blocking schemes, and purging blocks until given criteria are satisfied. We test our methods through an extensive experimental evaluation, using a voluminous data set with 182 million heterogeneous entities. The outcomes of our study show the applicability and the high performance of our approach.

Keywords

Attribute-agnostic blocking, Data cleaning, Entity resolution

ASJC Scopus subject areas

Computer Science(all)
Computer Networks and Communications

Cite this

Beyond 100 Million Entities: Large-scale Blocking-based Resolution for Heterogeneous Data. / Papadakis, George; Ioannou, Ekaterini; Niederée, Claudia et al.
WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining. 2012. p. 53-62 (WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Papadakis, G, Ioannou, E, Niederée, C, Palpanas, T & Nejdl, W 2012, Beyond 100 Million Entities: Large-scale Blocking-based Resolution for Heterogeneous Data. in WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining. WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining, pp. 53-62, 5th ACM International Conference on Web Search and Data Mining, WSDM 2012, Seattle, WA, United States, 8 Feb 2012. https://doi.org/10.1145/2124295.2124305

Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., & Nejdl, W. (2012). Beyond 100 Million Entities: Large-scale Blocking-based Resolution for Heterogeneous Data. In WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining (pp. 53-62). (WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining). https://doi.org/10.1145/2124295.2124305

Papadakis G, Ioannou E, Niederée C, Palpanas T, Nejdl W. Beyond 100 Million Entities: Large-scale Blocking-based Resolution for Heterogeneous Data. In WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining. 2012. p. 53-62. (WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining). doi: 10.1145/2124295.2124305

Papadakis, George ; Ioannou, Ekaterini ; Niederée, Claudia et al. / Beyond 100 Million Entities : Large-scale Blocking-based Resolution for Heterogeneous Data. WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining. 2012. pp. 53-62 (WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining).

Download

@inproceedings{e8ea5aa82ee943b58f21546219d6ba40,

title = "Beyond 100 Million Entities: Large-scale Blocking-based Resolution for Heterogeneous Data",

abstract = "A prerequisite for leveraging the vast amount of data available on the Web is Entity Resolution, i.e., the process of identifying and linking data that describe the same real-world objects. To make this inherently quadratic process applicable to large data sets, blocking is typically employed: entities (records) are grouped into clusters - the blocks - of matching candidates and only entities of the same block are compared. However, novel blocking techniques are required for dealing with the noisy, heterogeneous, semi-structured, user-generated data in the Web, as traditional blocking techniques are inapplicable due to their reliance on schema information. The introduction of redundancy, improves the robustness of blocking methods but comes at the price of additional computational cost. In this paper, we present methods for enhancing the eficiency of redundancy-bearing blocking methods, such as our attributeagnostic blocking approach. We introduce novel blocking schemes that build blocks based on a variety of evidences, including entity identifiers and relationships between entities; they significantly reduce the required number of comparisons, while maintaining blocking effectiveness at very high levels. We also introduce two theoretical measures that provide a reliable estimation of the performance of a blocking method, without requiring the analytical processing of its blocks. Based on these measures, we develop two techniques for improving the performance of blocking: combining individual, complementary blocking schemes, and purging blocks until given criteria are satisfied. We test our methods through an extensive experimental evaluation, using a voluminous data set with 182 million heterogeneous entities. The outcomes of our study show the applicability and the high performance of our approach.",

keywords = "Attribute-agnostic blocking, Data cleaning, Entity resolution",

author = "George Papadakis and Ekaterini Ioannou and Claudia Nieder{\'e}e and Themis Palpanas and Wolfgang Nejdl",

year = "2012",

month = feb,

day = "8",

doi = "10.1145/2124295.2124305",

language = "English",

isbn = "9781450307475",

series = "WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining",

pages = "53--62",

booktitle = "WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining",

note = "5th ACM International Conference on Web Search and Data Mining, WSDM 2012 ; Conference date: 08-02-2012 Through 12-02-2012",

}

Download

TY - GEN

T1 - Beyond 100 Million Entities

T2 - 5th ACM International Conference on Web Search and Data Mining, WSDM 2012

AU - Papadakis, George

AU - Ioannou, Ekaterini

AU - Niederée, Claudia

AU - Palpanas, Themis

AU - Nejdl, Wolfgang

PY - 2012/2/8

Y1 - 2012/2/8

N2 - A prerequisite for leveraging the vast amount of data available on the Web is Entity Resolution, i.e., the process of identifying and linking data that describe the same real-world objects. To make this inherently quadratic process applicable to large data sets, blocking is typically employed: entities (records) are grouped into clusters - the blocks - of matching candidates and only entities of the same block are compared. However, novel blocking techniques are required for dealing with the noisy, heterogeneous, semi-structured, user-generated data in the Web, as traditional blocking techniques are inapplicable due to their reliance on schema information. The introduction of redundancy, improves the robustness of blocking methods but comes at the price of additional computational cost. In this paper, we present methods for enhancing the eficiency of redundancy-bearing blocking methods, such as our attributeagnostic blocking approach. We introduce novel blocking schemes that build blocks based on a variety of evidences, including entity identifiers and relationships between entities; they significantly reduce the required number of comparisons, while maintaining blocking effectiveness at very high levels. We also introduce two theoretical measures that provide a reliable estimation of the performance of a blocking method, without requiring the analytical processing of its blocks. Based on these measures, we develop two techniques for improving the performance of blocking: combining individual, complementary blocking schemes, and purging blocks until given criteria are satisfied. We test our methods through an extensive experimental evaluation, using a voluminous data set with 182 million heterogeneous entities. The outcomes of our study show the applicability and the high performance of our approach.

AB - A prerequisite for leveraging the vast amount of data available on the Web is Entity Resolution, i.e., the process of identifying and linking data that describe the same real-world objects. To make this inherently quadratic process applicable to large data sets, blocking is typically employed: entities (records) are grouped into clusters - the blocks - of matching candidates and only entities of the same block are compared. However, novel blocking techniques are required for dealing with the noisy, heterogeneous, semi-structured, user-generated data in the Web, as traditional blocking techniques are inapplicable due to their reliance on schema information. The introduction of redundancy, improves the robustness of blocking methods but comes at the price of additional computational cost. In this paper, we present methods for enhancing the eficiency of redundancy-bearing blocking methods, such as our attributeagnostic blocking approach. We introduce novel blocking schemes that build blocks based on a variety of evidences, including entity identifiers and relationships between entities; they significantly reduce the required number of comparisons, while maintaining blocking effectiveness at very high levels. We also introduce two theoretical measures that provide a reliable estimation of the performance of a blocking method, without requiring the analytical processing of its blocks. Based on these measures, we develop two techniques for improving the performance of blocking: combining individual, complementary blocking schemes, and purging blocks until given criteria are satisfied. We test our methods through an extensive experimental evaluation, using a voluminous data set with 182 million heterogeneous entities. The outcomes of our study show the applicability and the high performance of our approach.

KW - Attribute-agnostic blocking

KW - Data cleaning

KW - Entity resolution

UR - http://www.scopus.com/inward/record.url?scp=84858041897&partnerID=8YFLogxK

U2 - 10.1145/2124295.2124305

DO - 10.1145/2124295.2124305

M3 - Conference contribution

AN - SCOPUS:84858041897

SN - 9781450307475

T3 - WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining

SP - 53

EP - 62

BT - WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining

Y2 - 8 February 2012 through 12 February 2012

ER -

Research@Leibniz University

Beyond 100 Million Entities: Large-scale Blocking-based Resolution for Heterogeneous Data

Authors

Research Organisations

External Research Organisations

Details

Publication series

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets