Details
Originalsprache | Englisch |
---|---|
Titel des Sammelwerks | JCDL'11 - Proceedings of the 2011 ACM/IEEE Joint Conference on Digital Libraries |
Seiten | 85-94 |
Seitenumfang | 10 |
Publikationsstatus | Veröffentlicht - 13 Juni 2011 |
Veranstaltung | 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL'11 - Ottawa, ON, Kanada Dauer: 13 Juni 2011 → 17 Juni 2011 |
Publikationsreihe
Name | Proceedings of the ACM/IEEE Joint Conference on Digital Libraries |
---|---|
ISSN (Print) | 1552-5996 |
Abstract
Entity resolution is the task of identifying entities that refer to the same real-world object. It has important applications in the context of digital libraries, such as citation matching and author disambiguation. Blocking is an established methodology for efficiently addressing this problem; it clusters similar entities together, and compares solely entities inside each cluster. In order to effectively deal with the current large, noisy and heterogeneous data collections, novel blocking methods that rely on redundancy have been introduced: they associate each entity with multiple blocks in order to increase recall, thus increasing the computational cost, as well. In this paper, we introduce novel techniques that remove the superfluous comparisons from any redundancy-based blocking method. They improve the time-efficiency of the latter without any impact on the end result. We present the optimal solution to this problem that discards all redundant comparisons at the cost of quadratic space complexity. For applications with space limitations, we also present an alternative, lightweight solution that operates at the abstract level of blocks in order to discard a significant part of the redundant comparisons. We evaluate our techniques on two large, real-world data sets and verify the significant improvements they convey when integrated into existing blocking methods.
ASJC Scopus Sachgebiete
- Ingenieurwesen (insg.)
- Allgemeiner Maschinenbau
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
JCDL'11 - Proceedings of the 2011 ACM/IEEE Joint Conference on Digital Libraries. 2011. S. 85-94 (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries).
Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review
}
TY - GEN
T1 - Eliminating the Redundancy in Blocking-based Entity Resolution Methods
AU - Papadakis, George
AU - Ioannou, Ekaterini
AU - Niederée, Claudia
AU - Palpanas, Themis
AU - Nejdl, Wolfgang
PY - 2011/6/13
Y1 - 2011/6/13
N2 - Entity resolution is the task of identifying entities that refer to the same real-world object. It has important applications in the context of digital libraries, such as citation matching and author disambiguation. Blocking is an established methodology for efficiently addressing this problem; it clusters similar entities together, and compares solely entities inside each cluster. In order to effectively deal with the current large, noisy and heterogeneous data collections, novel blocking methods that rely on redundancy have been introduced: they associate each entity with multiple blocks in order to increase recall, thus increasing the computational cost, as well. In this paper, we introduce novel techniques that remove the superfluous comparisons from any redundancy-based blocking method. They improve the time-efficiency of the latter without any impact on the end result. We present the optimal solution to this problem that discards all redundant comparisons at the cost of quadratic space complexity. For applications with space limitations, we also present an alternative, lightweight solution that operates at the abstract level of blocks in order to discard a significant part of the redundant comparisons. We evaluate our techniques on two large, real-world data sets and verify the significant improvements they convey when integrated into existing blocking methods.
AB - Entity resolution is the task of identifying entities that refer to the same real-world object. It has important applications in the context of digital libraries, such as citation matching and author disambiguation. Blocking is an established methodology for efficiently addressing this problem; it clusters similar entities together, and compares solely entities inside each cluster. In order to effectively deal with the current large, noisy and heterogeneous data collections, novel blocking methods that rely on redundancy have been introduced: they associate each entity with multiple blocks in order to increase recall, thus increasing the computational cost, as well. In this paper, we introduce novel techniques that remove the superfluous comparisons from any redundancy-based blocking method. They improve the time-efficiency of the latter without any impact on the end result. We present the optimal solution to this problem that discards all redundant comparisons at the cost of quadratic space complexity. For applications with space limitations, we also present an alternative, lightweight solution that operates at the abstract level of blocks in order to discard a significant part of the redundant comparisons. We evaluate our techniques on two large, real-world data sets and verify the significant improvements they convey when integrated into existing blocking methods.
KW - data cleaning
KW - entity resolution
KW - redundancy-based blocking
UR - http://www.scopus.com/inward/record.url?scp=79960519872&partnerID=8YFLogxK
U2 - 10.1145/1998076.1998093
DO - 10.1145/1998076.1998093
M3 - Conference contribution
AN - SCOPUS:79960519872
SN - 9781450307444
T3 - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries
SP - 85
EP - 94
BT - JCDL'11 - Proceedings of the 2011 ACM/IEEE Joint Conference on Digital Libraries
T2 - 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL'11
Y2 - 13 June 2011 through 17 June 2011
ER -