Map to Humans and Reduce Error: Crowdsourcing for Deduplication Applied to Digital Libraries

Mihai Georgescu; Dang Duc Pham; Claudiu S. Firan; Wolfgang Nejdl; Julien Gaugaz

doi:10.1145/2396761.2398554

Details

Original language	English
Title of host publication	CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management
Pages	1970-1974
Number of pages	5
Publication status	Published - 29 Oct 2012
Event	21st ACM International Conference on Information and Knowledge Management, CIKM 2012 - Maui, HI, United States Duration: 29 Oct 2012 → 2 Nov 2012

Publication series

Name	ACM International Conference Proceeding Series

Abstract

Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed - with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.

Keywords

active learning, crowdsourcing, duplicate detection, machine learning, optimization

ASJC Scopus subject areas

Computer Science(all)
Software
Computer Science(all)
Human-Computer Interaction
Computer Science(all)
Computer Vision and Pattern Recognition
Computer Science(all)
Computer Networks and Communications

Cite this

Map to Humans and Reduce Error: Crowdsourcing for Deduplication Applied to Digital Libraries. / Georgescu, Mihai; Pham, Dang Duc; Firan, Claudiu S. et al.
CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 2012. p. 1970-1974 (ACM International Conference Proceeding Series).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Georgescu, M, Pham, DD, Firan, CS, Nejdl, W & Gaugaz, J 2012, Map to Humans and Reduce Error: Crowdsourcing for Deduplication Applied to Digital Libraries. in CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM International Conference Proceeding Series, pp. 1970-1974, 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, Maui, HI, United States, 29 Oct 2012. https://doi.org/10.1145/2396761.2398554

Georgescu, M., Pham, D. D., Firan, C. S., Nejdl, W., & Gaugaz, J. (2012). Map to Humans and Reduce Error: Crowdsourcing for Deduplication Applied to Digital Libraries. In CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management (pp. 1970-1974). (ACM International Conference Proceeding Series). https://doi.org/10.1145/2396761.2398554

Georgescu M, Pham DD, Firan CS, Nejdl W, Gaugaz J. Map to Humans and Reduce Error: Crowdsourcing for Deduplication Applied to Digital Libraries. In CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 2012. p. 1970-1974. (ACM International Conference Proceeding Series). doi: 10.1145/2396761.2398554

Georgescu, Mihai ; Pham, Dang Duc ; Firan, Claudiu S. et al. / Map to Humans and Reduce Error : Crowdsourcing for Deduplication Applied to Digital Libraries. CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 2012. pp. 1970-1974 (ACM International Conference Proceeding Series).

Download

@inproceedings{5076806322644a9c870d9f5486e40f41,

title = "Map to Humans and Reduce Error: Crowdsourcing for Deduplication Applied to Digital Libraries",

abstract = "Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed - with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.",

keywords = "active learning, crowdsourcing, duplicate detection, machine learning, optimization",

author = "Mihai Georgescu and Pham, {Dang Duc} and Firan, {Claudiu S.} and Wolfgang Nejdl and Julien Gaugaz",

year = "2012",

month = oct,

day = "29",

doi = "10.1145/2396761.2398554",

language = "English",

isbn = "9781450311564",

series = "ACM International Conference Proceeding Series",

pages = "1970--1974",

booktitle = "CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management",

note = "21st ACM International Conference on Information and Knowledge Management, CIKM 2012 ; Conference date: 29-10-2012 Through 02-11-2012",

}

Download

TY - GEN

T1 - Map to Humans and Reduce Error

T2 - 21st ACM International Conference on Information and Knowledge Management, CIKM 2012

AU - Georgescu, Mihai

AU - Pham, Dang Duc

AU - Firan, Claudiu S.

AU - Nejdl, Wolfgang

AU - Gaugaz, Julien

PY - 2012/10/29

Y1 - 2012/10/29

N2 - Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed - with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.

AB - Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed - with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.

KW - active learning

KW - crowdsourcing

KW - duplicate detection

KW - machine learning

KW - optimization

UR - http://www.scopus.com/inward/record.url?scp=84871066615&partnerID=8YFLogxK

U2 - 10.1145/2396761.2398554

DO - 10.1145/2396761.2398554

M3 - Conference contribution

AN - SCOPUS:84871066615

SN - 9781450311564

T3 - ACM International Conference Proceeding Series

SP - 1970

EP - 1974

BT - CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management

Y2 - 29 October 2012 through 2 November 2012

ER -

Research@Leibniz University

Map to Humans and Reduce Error: Crowdsourcing for Deduplication Applied to Digital Libraries

Authors

Research Organisations

Details

Publication series

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets