Efficient entity resolution for large heterogeneous information spaces

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • George Papadakis
  • Ekaterini Loannou
  • Claudia Niederée
  • Peter Fankhauser

Research Organisations

External Research Organisations

  • Fraunhofer Institute for Integrated Circuits (IIS)
View graph of relations

Details

Original languageEnglish
Title of host publicationProceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011
Pages535-544
Number of pages10
Publication statusPublished - Feb 2011
Event4th ACM International Conference on Web Search and Data Mining, WSDM 2011 - Hong Kong, China
Duration: 9 Feb 201112 Feb 2011

Publication series

NameProceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011

Abstract

We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledge about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient.

Keywords

    Attribute-Agnostic blocking, Data cleaning, Entity resolution

ASJC Scopus subject areas

Cite this

Efficient entity resolution for large heterogeneous information spaces. / Papadakis, George; Loannou, Ekaterini; Niederée, Claudia et al.
Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011. 2011. p. 535-544 (Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011).

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Papadakis, G, Loannou, E, Niederée, C & Fankhauser, P 2011, Efficient entity resolution for large heterogeneous information spaces. in Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011. Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011, pp. 535-544, 4th ACM International Conference on Web Search and Data Mining, WSDM 2011, Hong Kong, China, 9 Feb 2011. https://doi.org/10.1145/1935826.1935903
Papadakis, G., Loannou, E., Niederée, C., & Fankhauser, P. (2011). Efficient entity resolution for large heterogeneous information spaces. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011 (pp. 535-544). (Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011). https://doi.org/10.1145/1935826.1935903
Papadakis G, Loannou E, Niederée C, Fankhauser P. Efficient entity resolution for large heterogeneous information spaces. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011. 2011. p. 535-544. (Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011). doi: 10.1145/1935826.1935903
Papadakis, George ; Loannou, Ekaterini ; Niederée, Claudia et al. / Efficient entity resolution for large heterogeneous information spaces. Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011. 2011. pp. 535-544 (Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011).
Download
@inproceedings{3d168e321c3443cebd717da6197699f9,
title = "Efficient entity resolution for large heterogeneous information spaces",
abstract = "We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledge about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient.",
keywords = "Attribute-Agnostic blocking, Data cleaning, Entity resolution",
author = "George Papadakis and Ekaterini Loannou and Claudia Nieder{\'e}e and Peter Fankhauser",
year = "2011",
month = feb,
doi = "10.1145/1935826.1935903",
language = "English",
isbn = "9781450304931",
series = "Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011",
pages = "535--544",
booktitle = "Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011",
note = "4th ACM International Conference on Web Search and Data Mining, WSDM 2011 ; Conference date: 09-02-2011 Through 12-02-2011",

}

Download

TY - GEN

T1 - Efficient entity resolution for large heterogeneous information spaces

AU - Papadakis, George

AU - Loannou, Ekaterini

AU - Niederée, Claudia

AU - Fankhauser, Peter

PY - 2011/2

Y1 - 2011/2

N2 - We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledge about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient.

AB - We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledge about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient.

KW - Attribute-Agnostic blocking

KW - Data cleaning

KW - Entity resolution

UR - http://www.scopus.com/inward/record.url?scp=79952386495&partnerID=8YFLogxK

U2 - 10.1145/1935826.1935903

DO - 10.1145/1935826.1935903

M3 - Conference contribution

AN - SCOPUS:79952386495

SN - 9781450304931

T3 - Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011

SP - 535

EP - 544

BT - Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011

T2 - 4th ACM International Conference on Web Search and Data Mining, WSDM 2011

Y2 - 9 February 2011 through 12 February 2011

ER -