Citation needed: A taxonomy and algorithmic assessment of Wikipedia's verifiability

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • Miriam Redi
  • Jonathan Morgan
  • Besnik Fetahu
  • Dario Taraborelli

Research Organisations

External Research Organisations

  • Wikimedia Foundation
View graph of relations

Details

Original languageEnglish
Title of host publicationThe Web Conference 2019
Subtitle of host publicationProceedings of the World Wide Web Conference, WWW 2019
EditorsLing Liu, Ryen White
Place of PublicationNew York
Pages1567-1578
Number of pages12
ISBN (electronic)9781450366748
Publication statusPublished - 13 May 2019
Event2019 World Wide Web Conference, WWW 2019 - San Francisco, United States
Duration: 13 May 201917 May 2019

Abstract

Wikipedia is playing an increasingly central role on the web, and the policies its contributors follow when sourcing and fact-checking content affect million of readers. Among these core guiding principles, verifiability policies have a particularly important role. Verifiability requires that information included in a Wikipedia article be corroborated against reliable secondary sources. Because of the manual labor needed to curate Wikipedia at scale, however, its contents do not always evenly comply with these policies. Citations (i.e. reference to external sources) may not conform to verifiability requirements or may be missing altogether, potentially weakening the reliability of specific topic areas of the free encyclopedia. In this paper, we aim to provide an empirical characterization of the reasons why and how Wikipedia cites external sources to comply with its own verifiability guidelines. First, we construct a taxonomy of reasons why inline citations are required, by collecting labeled data from editors of multiple Wikipedia language editions. We then crowdsource a large-scale dataset of Wikipedia sentences annotated with categories derived from this taxonomy. Finally, we design algorithmic models to determine if a statement requires a citation, and to predict the citation reason. We evaluate the accuracy of such models across different classes of Wikipedia articles of varying quality, and on external datasets of claims annotated for fact-checking purposes.

Keywords

    Citations, Crowdsourcing, Neural Networks, Wikipedia

ASJC Scopus subject areas

Cite this

Citation needed: A taxonomy and algorithmic assessment of Wikipedia's verifiability. / Redi, Miriam; Morgan, Jonathan; Fetahu, Besnik et al.
The Web Conference 2019: Proceedings of the World Wide Web Conference, WWW 2019. ed. / Ling Liu; Ryen White. New York, 2019. p. 1567-1578.

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Redi, M, Morgan, J, Fetahu, B & Taraborelli, D 2019, Citation needed: A taxonomy and algorithmic assessment of Wikipedia's verifiability. in L Liu & R White (eds), The Web Conference 2019: Proceedings of the World Wide Web Conference, WWW 2019. New York, pp. 1567-1578, 2019 World Wide Web Conference, WWW 2019, San Francisco, United States, 13 May 2019. https://doi.org/10.1145/3308558.3313618
Redi, M., Morgan, J., Fetahu, B., & Taraborelli, D. (2019). Citation needed: A taxonomy and algorithmic assessment of Wikipedia's verifiability. In L. Liu, & R. White (Eds.), The Web Conference 2019: Proceedings of the World Wide Web Conference, WWW 2019 (pp. 1567-1578). https://doi.org/10.1145/3308558.3313618
Redi M, Morgan J, Fetahu B, Taraborelli D. Citation needed: A taxonomy and algorithmic assessment of Wikipedia's verifiability. In Liu L, White R, editors, The Web Conference 2019: Proceedings of the World Wide Web Conference, WWW 2019. New York. 2019. p. 1567-1578 doi: 10.1145/3308558.3313618
Redi, Miriam ; Morgan, Jonathan ; Fetahu, Besnik et al. / Citation needed : A taxonomy and algorithmic assessment of Wikipedia's verifiability. The Web Conference 2019: Proceedings of the World Wide Web Conference, WWW 2019. editor / Ling Liu ; Ryen White. New York, 2019. pp. 1567-1578
Download
@inproceedings{e865de039f9f400d8377219b746dc472,
title = "Citation needed: A taxonomy and algorithmic assessment of Wikipedia's verifiability",
abstract = "Wikipedia is playing an increasingly central role on the web, and the policies its contributors follow when sourcing and fact-checking content affect million of readers. Among these core guiding principles, verifiability policies have a particularly important role. Verifiability requires that information included in a Wikipedia article be corroborated against reliable secondary sources. Because of the manual labor needed to curate Wikipedia at scale, however, its contents do not always evenly comply with these policies. Citations (i.e. reference to external sources) may not conform to verifiability requirements or may be missing altogether, potentially weakening the reliability of specific topic areas of the free encyclopedia. In this paper, we aim to provide an empirical characterization of the reasons why and how Wikipedia cites external sources to comply with its own verifiability guidelines. First, we construct a taxonomy of reasons why inline citations are required, by collecting labeled data from editors of multiple Wikipedia language editions. We then crowdsource a large-scale dataset of Wikipedia sentences annotated with categories derived from this taxonomy. Finally, we design algorithmic models to determine if a statement requires a citation, and to predict the citation reason. We evaluate the accuracy of such models across different classes of Wikipedia articles of varying quality, and on external datasets of claims annotated for fact-checking purposes.",
keywords = "Citations, Crowdsourcing, Neural Networks, Wikipedia",
author = "Miriam Redi and Jonathan Morgan and Besnik Fetahu and Dario Taraborelli",
note = "Funding information: We would like to thank the community members of the English, French and Italian Wikipedia for helping with data labeling and for their precious suggestions, and Bahodir Mansurov and Aaron Halfaker from the Wikimedia Foundation, for their help building the WikiLabels task. This work is partly funded by the ERC Advanced Grant ALEXANDRIA (grant no. 339233), and BMBF Simple-ML project (grant no. 01IS18054A).; 2019 World Wide Web Conference, WWW 2019 ; Conference date: 13-05-2019 Through 17-05-2019",
year = "2019",
month = may,
day = "13",
doi = "10.1145/3308558.3313618",
language = "English",
pages = "1567--1578",
editor = "Ling Liu and Ryen White",
booktitle = "The Web Conference 2019",

}

Download

TY - GEN

T1 - Citation needed

T2 - 2019 World Wide Web Conference, WWW 2019

AU - Redi, Miriam

AU - Morgan, Jonathan

AU - Fetahu, Besnik

AU - Taraborelli, Dario

N1 - Funding information: We would like to thank the community members of the English, French and Italian Wikipedia for helping with data labeling and for their precious suggestions, and Bahodir Mansurov and Aaron Halfaker from the Wikimedia Foundation, for their help building the WikiLabels task. This work is partly funded by the ERC Advanced Grant ALEXANDRIA (grant no. 339233), and BMBF Simple-ML project (grant no. 01IS18054A).

PY - 2019/5/13

Y1 - 2019/5/13

N2 - Wikipedia is playing an increasingly central role on the web, and the policies its contributors follow when sourcing and fact-checking content affect million of readers. Among these core guiding principles, verifiability policies have a particularly important role. Verifiability requires that information included in a Wikipedia article be corroborated against reliable secondary sources. Because of the manual labor needed to curate Wikipedia at scale, however, its contents do not always evenly comply with these policies. Citations (i.e. reference to external sources) may not conform to verifiability requirements or may be missing altogether, potentially weakening the reliability of specific topic areas of the free encyclopedia. In this paper, we aim to provide an empirical characterization of the reasons why and how Wikipedia cites external sources to comply with its own verifiability guidelines. First, we construct a taxonomy of reasons why inline citations are required, by collecting labeled data from editors of multiple Wikipedia language editions. We then crowdsource a large-scale dataset of Wikipedia sentences annotated with categories derived from this taxonomy. Finally, we design algorithmic models to determine if a statement requires a citation, and to predict the citation reason. We evaluate the accuracy of such models across different classes of Wikipedia articles of varying quality, and on external datasets of claims annotated for fact-checking purposes.

AB - Wikipedia is playing an increasingly central role on the web, and the policies its contributors follow when sourcing and fact-checking content affect million of readers. Among these core guiding principles, verifiability policies have a particularly important role. Verifiability requires that information included in a Wikipedia article be corroborated against reliable secondary sources. Because of the manual labor needed to curate Wikipedia at scale, however, its contents do not always evenly comply with these policies. Citations (i.e. reference to external sources) may not conform to verifiability requirements or may be missing altogether, potentially weakening the reliability of specific topic areas of the free encyclopedia. In this paper, we aim to provide an empirical characterization of the reasons why and how Wikipedia cites external sources to comply with its own verifiability guidelines. First, we construct a taxonomy of reasons why inline citations are required, by collecting labeled data from editors of multiple Wikipedia language editions. We then crowdsource a large-scale dataset of Wikipedia sentences annotated with categories derived from this taxonomy. Finally, we design algorithmic models to determine if a statement requires a citation, and to predict the citation reason. We evaluate the accuracy of such models across different classes of Wikipedia articles of varying quality, and on external datasets of claims annotated for fact-checking purposes.

KW - Citations

KW - Crowdsourcing

KW - Neural Networks

KW - Wikipedia

UR - http://www.scopus.com/inward/record.url?scp=85066892225&partnerID=8YFLogxK

U2 - 10.1145/3308558.3313618

DO - 10.1145/3308558.3313618

M3 - Conference contribution

AN - SCOPUS:85066892225

SP - 1567

EP - 1578

BT - The Web Conference 2019

A2 - Liu, Ling

A2 - White, Ryen

CY - New York

Y2 - 13 May 2019 through 17 May 2019

ER -