Map to Humans and Reduce Error: Crowdsourcing for Deduplication Applied to Digital Libraries

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • Mihai Georgescu
  • Dang Duc Pham
  • Claudiu S. Firan
  • Wolfgang Nejdl
  • Julien Gaugaz

Research Organisations

View graph of relations

Details

Original languageEnglish
Title of host publicationCIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management
Pages1970-1974
Number of pages5
Publication statusPublished - 29 Oct 2012
Event21st ACM International Conference on Information and Knowledge Management, CIKM 2012 - Maui, HI, United States
Duration: 29 Oct 20122 Nov 2012

Publication series

NameACM International Conference Proceeding Series

Abstract

Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed - with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.

Keywords

    active learning, crowdsourcing, duplicate detection, machine learning, optimization

ASJC Scopus subject areas

Cite this

Map to Humans and Reduce Error: Crowdsourcing for Deduplication Applied to Digital Libraries. / Georgescu, Mihai; Pham, Dang Duc; Firan, Claudiu S. et al.
CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 2012. p. 1970-1974 (ACM International Conference Proceeding Series).

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Georgescu, M, Pham, DD, Firan, CS, Nejdl, W & Gaugaz, J 2012, Map to Humans and Reduce Error: Crowdsourcing for Deduplication Applied to Digital Libraries. in CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM International Conference Proceeding Series, pp. 1970-1974, 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, Maui, HI, United States, 29 Oct 2012. https://doi.org/10.1145/2396761.2398554
Georgescu, M., Pham, D. D., Firan, C. S., Nejdl, W., & Gaugaz, J. (2012). Map to Humans and Reduce Error: Crowdsourcing for Deduplication Applied to Digital Libraries. In CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management (pp. 1970-1974). (ACM International Conference Proceeding Series). https://doi.org/10.1145/2396761.2398554
Georgescu M, Pham DD, Firan CS, Nejdl W, Gaugaz J. Map to Humans and Reduce Error: Crowdsourcing for Deduplication Applied to Digital Libraries. In CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 2012. p. 1970-1974. (ACM International Conference Proceeding Series). doi: 10.1145/2396761.2398554
Georgescu, Mihai ; Pham, Dang Duc ; Firan, Claudiu S. et al. / Map to Humans and Reduce Error : Crowdsourcing for Deduplication Applied to Digital Libraries. CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 2012. pp. 1970-1974 (ACM International Conference Proceeding Series).
Download
@inproceedings{5076806322644a9c870d9f5486e40f41,
title = "Map to Humans and Reduce Error: Crowdsourcing for Deduplication Applied to Digital Libraries",
abstract = "Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed - with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.",
keywords = "active learning, crowdsourcing, duplicate detection, machine learning, optimization",
author = "Mihai Georgescu and Pham, {Dang Duc} and Firan, {Claudiu S.} and Wolfgang Nejdl and Julien Gaugaz",
year = "2012",
month = oct,
day = "29",
doi = "10.1145/2396761.2398554",
language = "English",
isbn = "9781450311564",
series = "ACM International Conference Proceeding Series",
pages = "1970--1974",
booktitle = "CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management",
note = "21st ACM International Conference on Information and Knowledge Management, CIKM 2012 ; Conference date: 29-10-2012 Through 02-11-2012",

}

Download

TY - GEN

T1 - Map to Humans and Reduce Error

T2 - 21st ACM International Conference on Information and Knowledge Management, CIKM 2012

AU - Georgescu, Mihai

AU - Pham, Dang Duc

AU - Firan, Claudiu S.

AU - Nejdl, Wolfgang

AU - Gaugaz, Julien

PY - 2012/10/29

Y1 - 2012/10/29

N2 - Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed - with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.

AB - Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed - with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.

KW - active learning

KW - crowdsourcing

KW - duplicate detection

KW - machine learning

KW - optimization

UR - http://www.scopus.com/inward/record.url?scp=84871066615&partnerID=8YFLogxK

U2 - 10.1145/2396761.2398554

DO - 10.1145/2396761.2398554

M3 - Conference contribution

AN - SCOPUS:84871066615

SN - 9781450311564

T3 - ACM International Conference Proceeding Series

SP - 1970

EP - 1974

BT - CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management

Y2 - 29 October 2012 through 2 November 2012

ER -

By the same author(s)