Efficient entity resolution for large heterogeneous information spaces

George Papadakis; Ekaterini Loannou; Claudia Niederée; Peter Fankhauser

doi:10.1145/1935826.1935903

Details

Original language	English
Title of host publication	Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011
Pages	535-544
Number of pages	10
Publication status	Published - Feb 2011
Event	4th ACM International Conference on Web Search and Data Mining, WSDM 2011 - Hong Kong, China Duration: 9 Feb 2011 → 12 Feb 2011

Publication series

Name	Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011

Abstract

We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledge about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient.

Keywords

Attribute-Agnostic blocking, Data cleaning, Entity resolution

ASJC Scopus subject areas

Computer Science(all)
Computer Networks and Communications
Computer Science(all)
Computer Science Applications
Computer Science(all)
Software

Cite this

Efficient entity resolution for large heterogeneous information spaces. / Papadakis, George; Loannou, Ekaterini; Niederée, Claudia et al.
Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011. 2011. p. 535-544 (Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Papadakis, G, Loannou, E, Niederée, C & Fankhauser, P 2011, Efficient entity resolution for large heterogeneous information spaces. in Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011. Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011, pp. 535-544, 4th ACM International Conference on Web Search and Data Mining, WSDM 2011, Hong Kong, China, 9 Feb 2011. https://doi.org/10.1145/1935826.1935903

Papadakis, G., Loannou, E., Niederée, C., & Fankhauser, P. (2011). Efficient entity resolution for large heterogeneous information spaces. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011 (pp. 535-544). (Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011). https://doi.org/10.1145/1935826.1935903

Papadakis G, Loannou E, Niederée C, Fankhauser P. Efficient entity resolution for large heterogeneous information spaces. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011. 2011. p. 535-544. (Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011). doi: 10.1145/1935826.1935903

Papadakis, George ; Loannou, Ekaterini ; Niederée, Claudia et al. / Efficient entity resolution for large heterogeneous information spaces. Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011. 2011. pp. 535-544 (Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011).

Download

@inproceedings{3d168e321c3443cebd717da6197699f9,

title = "Efficient entity resolution for large heterogeneous information spaces",

abstract = "We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledge about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient.",

keywords = "Attribute-Agnostic blocking, Data cleaning, Entity resolution",

author = "George Papadakis and Ekaterini Loannou and Claudia Nieder{\'e}e and Peter Fankhauser",

year = "2011",

month = feb,

doi = "10.1145/1935826.1935903",

language = "English",

isbn = "9781450304931",

series = "Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011",

pages = "535--544",

booktitle = "Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011",

note = "4th ACM International Conference on Web Search and Data Mining, WSDM 2011 ; Conference date: 09-02-2011 Through 12-02-2011",

}

Download

TY - GEN

T1 - Efficient entity resolution for large heterogeneous information spaces

AU - Papadakis, George

AU - Loannou, Ekaterini

AU - Niederée, Claudia

AU - Fankhauser, Peter

PY - 2011/2

Y1 - 2011/2

N2 - We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledge about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient.

AB - We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledge about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient.

KW - Attribute-Agnostic blocking

KW - Data cleaning

KW - Entity resolution

UR - http://www.scopus.com/inward/record.url?scp=79952386495&partnerID=8YFLogxK

U2 - 10.1145/1935826.1935903

DO - 10.1145/1935826.1935903

M3 - Conference contribution

AN - SCOPUS:79952386495

SN - 9781450304931

T3 - Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011

SP - 535

EP - 544

BT - Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011

T2 - 4th ACM International Conference on Web Search and Data Mining, WSDM 2011

Y2 - 9 February 2011 through 12 February 2011

ER -

Research@Leibniz University

Efficient entity resolution for large heterogeneous information spaces

Authors

Research Organisations

External Research Organisations