Details
Original language | English |
---|---|
Title of host publication | Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011 |
Pages | 535-544 |
Number of pages | 10 |
Publication status | Published - Feb 2011 |
Event | 4th ACM International Conference on Web Search and Data Mining, WSDM 2011 - Hong Kong, China Duration: 9 Feb 2011 → 12 Feb 2011 |
Publication series
Name | Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011 |
---|
Abstract
We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledge about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient.
Keywords
- Attribute-Agnostic blocking, Data cleaning, Entity resolution
ASJC Scopus subject areas
- Computer Science(all)
- Computer Networks and Communications
- Computer Science(all)
- Computer Science Applications
- Computer Science(all)
- Software
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011. 2011. p. 535-544 (Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Efficient entity resolution for large heterogeneous information spaces
AU - Papadakis, George
AU - Loannou, Ekaterini
AU - Niederée, Claudia
AU - Fankhauser, Peter
PY - 2011/2
Y1 - 2011/2
N2 - We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledge about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient.
AB - We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledge about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient.
KW - Attribute-Agnostic blocking
KW - Data cleaning
KW - Entity resolution
UR - http://www.scopus.com/inward/record.url?scp=79952386495&partnerID=8YFLogxK
U2 - 10.1145/1935826.1935903
DO - 10.1145/1935826.1935903
M3 - Conference contribution
AN - SCOPUS:79952386495
SN - 9781450304931
T3 - Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011
SP - 535
EP - 544
BT - Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011
T2 - 4th ACM International Conference on Web Search and Data Mining, WSDM 2011
Y2 - 9 February 2011 through 12 February 2011
ER -