Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • Renato Stoffalette João
  • Pavlos Fafalios
  • Stefan Dietze

Research Organisations

External Research Organisations

  • GESIS - Leibniz Institute for the Social Sciences
View graph of relations

Details

Original languageEnglish
Title of host publicationSAC '19
Subtitle of host publicationProceedings of the 34th ACM/SIGAPP Symposium on Applied Computing
Place of PublicationNew York
PublisherAssociation for Computing Machinery (ACM)
Pages1019-1026
Number of pages8
ISBN (print)978-1-4503-5933-7
Publication statusPublished - 8 Apr 2019
Event34th Annual ACM Symposium on Applied Computing, SAC 2019 - Limassol, Cyprus
Duration: 8 Apr 201912 Apr 2019

Abstract

Entity Linking (EL) is the task of automatically identifying entity mentions in a piece of text and resolving them to a corresponding entity in a reference knowledge base like Wikipedia. There is a large number of EL tools available for different types of documents and domains, yet EL remains a challenging task where the lack of precision on particularly ambiguous mentions often spoils the usefulness of automated disambiguation results in real applications. A priori approximations of the difficulty to link a particular entity mention can facilitate flagging of critical cases as part of semi-automated EL systems, while detecting latent factors that affect the EL performance, like corpus-specific features, can provide insights on how to improve a system based on the special characteristics of the underlying corpus. In this paper, we first introduce a consensus-based method to generate difficulty labels for entity mentions on arbitrary corpora. The difficulty labels are then exploited as training data for a supervised classification task able to predict the EL difficulty of entity mentions using a variety of features. Experiments over a corpus of news articles show that EL difficulty can be estimated with high accuracy, revealing also latent features that affect EL performance. Finally, evaluation results demonstrate the effectiveness of the proposed method to inform semi-automated EL pipelines.

Keywords

    Distant Supervision, Entity Linking, Named Entity Recognition and Disambiguation, Supervised Classification

ASJC Scopus subject areas

Cite this

Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty. / João, Renato Stoffalette; Fafalios, Pavlos; Dietze, Stefan.
SAC '19: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. New York: Association for Computing Machinery (ACM), 2019. p. 1019-1026.

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

João, RS, Fafalios, P & Dietze, S 2019, Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty. in SAC '19: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. Association for Computing Machinery (ACM), New York, pp. 1019-1026, 34th Annual ACM Symposium on Applied Computing, SAC 2019, Limassol, Cyprus, 8 Apr 2019. https://doi.org/10.48550/arXiv.1812.10387, https://doi.org/10.1145/3297280.3297381
João, R. S., Fafalios, P., & Dietze, S. (2019). Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty. In SAC '19: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (pp. 1019-1026). Association for Computing Machinery (ACM). https://doi.org/10.48550/arXiv.1812.10387, https://doi.org/10.1145/3297280.3297381
João RS, Fafalios P, Dietze S. Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty. In SAC '19: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. New York: Association for Computing Machinery (ACM). 2019. p. 1019-1026 doi: 10.48550/arXiv.1812.10387, 10.1145/3297280.3297381
João, Renato Stoffalette ; Fafalios, Pavlos ; Dietze, Stefan. / Same but Different : Distant Supervision for Predicting and Understanding Entity Linking Difficulty. SAC '19: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. New York : Association for Computing Machinery (ACM), 2019. pp. 1019-1026
Download
@inproceedings{6cb2e71b6d9d4d289a3e23ba5314f52c,
title = "Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty",
abstract = "Entity Linking (EL) is the task of automatically identifying entity mentions in a piece of text and resolving them to a corresponding entity in a reference knowledge base like Wikipedia. There is a large number of EL tools available for different types of documents and domains, yet EL remains a challenging task where the lack of precision on particularly ambiguous mentions often spoils the usefulness of automated disambiguation results in real applications. A priori approximations of the difficulty to link a particular entity mention can facilitate flagging of critical cases as part of semi-automated EL systems, while detecting latent factors that affect the EL performance, like corpus-specific features, can provide insights on how to improve a system based on the special characteristics of the underlying corpus. In this paper, we first introduce a consensus-based method to generate difficulty labels for entity mentions on arbitrary corpora. The difficulty labels are then exploited as training data for a supervised classification task able to predict the EL difficulty of entity mentions using a variety of features. Experiments over a corpus of news articles show that EL difficulty can be estimated with high accuracy, revealing also latent features that affect EL performance. Finally, evaluation results demonstrate the effectiveness of the proposed method to inform semi-automated EL pipelines.",
keywords = "Distant Supervision, Entity Linking, Named Entity Recognition and Disambiguation, Supervised Classification",
author = "Jo{\~a}o, {Renato Stoffalette} and Pavlos Fafalios and Stefan Dietze",
note = "Funding Information: This work was partially supported by CNPq (Brazilian National Council for Scientific and Technological Development) under grant GDE No. 203268/2014-8 and the European Commission for the ERC Advanced Grant ALEXANDRIA under grant No. 339233.; 34th Annual ACM Symposium on Applied Computing, SAC 2019 ; Conference date: 08-04-2019 Through 12-04-2019",
year = "2019",
month = apr,
day = "8",
doi = "10.48550/arXiv.1812.10387",
language = "English",
isbn = "978-1-4503-5933-7",
pages = "1019--1026",
booktitle = "SAC '19",
publisher = "Association for Computing Machinery (ACM)",
address = "United States",

}

Download

TY - GEN

T1 - Same but Different

T2 - 34th Annual ACM Symposium on Applied Computing, SAC 2019

AU - João, Renato Stoffalette

AU - Fafalios, Pavlos

AU - Dietze, Stefan

N1 - Funding Information: This work was partially supported by CNPq (Brazilian National Council for Scientific and Technological Development) under grant GDE No. 203268/2014-8 and the European Commission for the ERC Advanced Grant ALEXANDRIA under grant No. 339233.

PY - 2019/4/8

Y1 - 2019/4/8

N2 - Entity Linking (EL) is the task of automatically identifying entity mentions in a piece of text and resolving them to a corresponding entity in a reference knowledge base like Wikipedia. There is a large number of EL tools available for different types of documents and domains, yet EL remains a challenging task where the lack of precision on particularly ambiguous mentions often spoils the usefulness of automated disambiguation results in real applications. A priori approximations of the difficulty to link a particular entity mention can facilitate flagging of critical cases as part of semi-automated EL systems, while detecting latent factors that affect the EL performance, like corpus-specific features, can provide insights on how to improve a system based on the special characteristics of the underlying corpus. In this paper, we first introduce a consensus-based method to generate difficulty labels for entity mentions on arbitrary corpora. The difficulty labels are then exploited as training data for a supervised classification task able to predict the EL difficulty of entity mentions using a variety of features. Experiments over a corpus of news articles show that EL difficulty can be estimated with high accuracy, revealing also latent features that affect EL performance. Finally, evaluation results demonstrate the effectiveness of the proposed method to inform semi-automated EL pipelines.

AB - Entity Linking (EL) is the task of automatically identifying entity mentions in a piece of text and resolving them to a corresponding entity in a reference knowledge base like Wikipedia. There is a large number of EL tools available for different types of documents and domains, yet EL remains a challenging task where the lack of precision on particularly ambiguous mentions often spoils the usefulness of automated disambiguation results in real applications. A priori approximations of the difficulty to link a particular entity mention can facilitate flagging of critical cases as part of semi-automated EL systems, while detecting latent factors that affect the EL performance, like corpus-specific features, can provide insights on how to improve a system based on the special characteristics of the underlying corpus. In this paper, we first introduce a consensus-based method to generate difficulty labels for entity mentions on arbitrary corpora. The difficulty labels are then exploited as training data for a supervised classification task able to predict the EL difficulty of entity mentions using a variety of features. Experiments over a corpus of news articles show that EL difficulty can be estimated with high accuracy, revealing also latent features that affect EL performance. Finally, evaluation results demonstrate the effectiveness of the proposed method to inform semi-automated EL pipelines.

KW - Distant Supervision

KW - Entity Linking

KW - Named Entity Recognition and Disambiguation

KW - Supervised Classification

UR - http://www.scopus.com/inward/record.url?scp=85065658346&partnerID=8YFLogxK

U2 - 10.48550/arXiv.1812.10387

DO - 10.48550/arXiv.1812.10387

M3 - Conference contribution

AN - SCOPUS:85065658346

SN - 978-1-4503-5933-7

SP - 1019

EP - 1026

BT - SAC '19

PB - Association for Computing Machinery (ACM)

CY - New York

Y2 - 8 April 2019 through 12 April 2019

ER -