Details
Original language | English |
---|---|
Title of host publication | CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management |
Pages | 1970-1974 |
Number of pages | 5 |
Publication status | Published - 29 Oct 2012 |
Event | 21st ACM International Conference on Information and Knowledge Management, CIKM 2012 - Maui, HI, United States Duration: 29 Oct 2012 → 2 Nov 2012 |
Publication series
Name | ACM International Conference Proceeding Series |
---|
Abstract
Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed - with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.
Keywords
- active learning, crowdsourcing, duplicate detection, machine learning, optimization
ASJC Scopus subject areas
- Computer Science(all)
- Software
- Computer Science(all)
- Human-Computer Interaction
- Computer Science(all)
- Computer Vision and Pattern Recognition
- Computer Science(all)
- Computer Networks and Communications
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 2012. p. 1970-1974 (ACM International Conference Proceeding Series).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Map to Humans and Reduce Error
T2 - 21st ACM International Conference on Information and Knowledge Management, CIKM 2012
AU - Georgescu, Mihai
AU - Pham, Dang Duc
AU - Firan, Claudiu S.
AU - Nejdl, Wolfgang
AU - Gaugaz, Julien
PY - 2012/10/29
Y1 - 2012/10/29
N2 - Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed - with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.
AB - Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed - with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.
KW - active learning
KW - crowdsourcing
KW - duplicate detection
KW - machine learning
KW - optimization
UR - http://www.scopus.com/inward/record.url?scp=84871066615&partnerID=8YFLogxK
U2 - 10.1145/2396761.2398554
DO - 10.1145/2396761.2398554
M3 - Conference contribution
AN - SCOPUS:84871066615
SN - 9781450311564
T3 - ACM International Conference Proceeding Series
SP - 1970
EP - 1974
BT - CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management
Y2 - 29 October 2012 through 2 November 2012
ER -