Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty

Renato Stoffalette João; Pavlos Fafalios; Stefan Dietze

doi:10.48550/arXiv.1812.10387

Details

Original language	English
Title of host publication	SAC '19
Subtitle of host publication	Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing
Place of Publication	New York
Publisher	Association for Computing Machinery (ACM)
Pages	1019-1026
Number of pages	8
ISBN (print)	978-1-4503-5933-7
Publication status	Published - 8 Apr 2019
Event	34th Annual ACM Symposium on Applied Computing, SAC 2019 - Limassol, Cyprus Duration: 8 Apr 2019 → 12 Apr 2019

Abstract

Entity Linking (EL) is the task of automatically identifying entity mentions in a piece of text and resolving them to a corresponding entity in a reference knowledge base like Wikipedia. There is a large number of EL tools available for different types of documents and domains, yet EL remains a challenging task where the lack of precision on particularly ambiguous mentions often spoils the usefulness of automated disambiguation results in real applications. A priori approximations of the difficulty to link a particular entity mention can facilitate flagging of critical cases as part of semi-automated EL systems, while detecting latent factors that affect the EL performance, like corpus-specific features, can provide insights on how to improve a system based on the special characteristics of the underlying corpus. In this paper, we first introduce a consensus-based method to generate difficulty labels for entity mentions on arbitrary corpora. The difficulty labels are then exploited as training data for a supervised classification task able to predict the EL difficulty of entity mentions using a variety of features. Experiments over a corpus of news articles show that EL difficulty can be estimated with high accuracy, revealing also latent features that affect EL performance. Finally, evaluation results demonstrate the effectiveness of the proposed method to inform semi-automated EL pipelines.

Keywords

Distant Supervision, Entity Linking, Named Entity Recognition and Disambiguation, Supervised Classification

ASJC Scopus subject areas

Computer Science(all)
Software

Cite this

Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty. / João, Renato Stoffalette; Fafalios, Pavlos; Dietze, Stefan.
SAC '19: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. New York: Association for Computing Machinery (ACM), 2019. p. 1019-1026.

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

João, RS, Fafalios, P & Dietze, S 2019, Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty. in SAC '19: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. Association for Computing Machinery (ACM), New York, pp. 1019-1026, 34th Annual ACM Symposium on Applied Computing, SAC 2019, Limassol, Cyprus, 8 Apr 2019. https://doi.org/10.48550/arXiv.1812.10387, https://doi.org/10.1145/3297280.3297381

João, R. S., Fafalios, P., & Dietze, S. (2019). Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty. In SAC '19: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (pp. 1019-1026). Association for Computing Machinery (ACM). https://doi.org/10.48550/arXiv.1812.10387, https://doi.org/10.1145/3297280.3297381

João RS, Fafalios P, Dietze S. Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty. In SAC '19: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. New York: Association for Computing Machinery (ACM). 2019. p. 1019-1026 doi: 10.48550/arXiv.1812.10387, 10.1145/3297280.3297381

João, Renato Stoffalette ; Fafalios, Pavlos ; Dietze, Stefan. / Same but Different : Distant Supervision for Predicting and Understanding Entity Linking Difficulty. SAC '19: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. New York : Association for Computing Machinery (ACM), 2019. pp. 1019-1026

Download

@inproceedings{6cb2e71b6d9d4d289a3e23ba5314f52c,

title = "Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty",

abstract = "Entity Linking (EL) is the task of automatically identifying entity mentions in a piece of text and resolving them to a corresponding entity in a reference knowledge base like Wikipedia. There is a large number of EL tools available for different types of documents and domains, yet EL remains a challenging task where the lack of precision on particularly ambiguous mentions often spoils the usefulness of automated disambiguation results in real applications. A priori approximations of the difficulty to link a particular entity mention can facilitate flagging of critical cases as part of semi-automated EL systems, while detecting latent factors that affect the EL performance, like corpus-specific features, can provide insights on how to improve a system based on the special characteristics of the underlying corpus. In this paper, we first introduce a consensus-based method to generate difficulty labels for entity mentions on arbitrary corpora. The difficulty labels are then exploited as training data for a supervised classification task able to predict the EL difficulty of entity mentions using a variety of features. Experiments over a corpus of news articles show that EL difficulty can be estimated with high accuracy, revealing also latent features that affect EL performance. Finally, evaluation results demonstrate the effectiveness of the proposed method to inform semi-automated EL pipelines.",

keywords = "Distant Supervision, Entity Linking, Named Entity Recognition and Disambiguation, Supervised Classification",

author = "Jo{\~a}o, {Renato Stoffalette} and Pavlos Fafalios and Stefan Dietze",

note = "Funding Information: This work was partially supported by CNPq (Brazilian National Council for Scientific and Technological Development) under grant GDE No. 203268/2014-8 and the European Commission for the ERC Advanced Grant ALEXANDRIA under grant No. 339233.; 34th Annual ACM Symposium on Applied Computing, SAC 2019 ; Conference date: 08-04-2019 Through 12-04-2019",

year = "2019",

month = apr,

day = "8",

doi = "10.48550/arXiv.1812.10387",

language = "English",

isbn = "978-1-4503-5933-7",

pages = "1019--1026",

booktitle = "SAC '19",

publisher = "Association for Computing Machinery (ACM)",

address = "United States",

}

Download

TY - GEN

T1 - Same but Different

T2 - 34th Annual ACM Symposium on Applied Computing, SAC 2019

AU - João, Renato Stoffalette

AU - Fafalios, Pavlos

AU - Dietze, Stefan

N1 - Funding Information: This work was partially supported by CNPq (Brazilian National Council for Scientific and Technological Development) under grant GDE No. 203268/2014-8 and the European Commission for the ERC Advanced Grant ALEXANDRIA under grant No. 339233.

PY - 2019/4/8

Y1 - 2019/4/8

N2 - Entity Linking (EL) is the task of automatically identifying entity mentions in a piece of text and resolving them to a corresponding entity in a reference knowledge base like Wikipedia. There is a large number of EL tools available for different types of documents and domains, yet EL remains a challenging task where the lack of precision on particularly ambiguous mentions often spoils the usefulness of automated disambiguation results in real applications. A priori approximations of the difficulty to link a particular entity mention can facilitate flagging of critical cases as part of semi-automated EL systems, while detecting latent factors that affect the EL performance, like corpus-specific features, can provide insights on how to improve a system based on the special characteristics of the underlying corpus. In this paper, we first introduce a consensus-based method to generate difficulty labels for entity mentions on arbitrary corpora. The difficulty labels are then exploited as training data for a supervised classification task able to predict the EL difficulty of entity mentions using a variety of features. Experiments over a corpus of news articles show that EL difficulty can be estimated with high accuracy, revealing also latent features that affect EL performance. Finally, evaluation results demonstrate the effectiveness of the proposed method to inform semi-automated EL pipelines.

AB - Entity Linking (EL) is the task of automatically identifying entity mentions in a piece of text and resolving them to a corresponding entity in a reference knowledge base like Wikipedia. There is a large number of EL tools available for different types of documents and domains, yet EL remains a challenging task where the lack of precision on particularly ambiguous mentions often spoils the usefulness of automated disambiguation results in real applications. A priori approximations of the difficulty to link a particular entity mention can facilitate flagging of critical cases as part of semi-automated EL systems, while detecting latent factors that affect the EL performance, like corpus-specific features, can provide insights on how to improve a system based on the special characteristics of the underlying corpus. In this paper, we first introduce a consensus-based method to generate difficulty labels for entity mentions on arbitrary corpora. The difficulty labels are then exploited as training data for a supervised classification task able to predict the EL difficulty of entity mentions using a variety of features. Experiments over a corpus of news articles show that EL difficulty can be estimated with high accuracy, revealing also latent features that affect EL performance. Finally, evaluation results demonstrate the effectiveness of the proposed method to inform semi-automated EL pipelines.

KW - Distant Supervision

KW - Entity Linking

KW - Named Entity Recognition and Disambiguation

KW - Supervised Classification

UR - http://www.scopus.com/inward/record.url?scp=85065658346&partnerID=8YFLogxK

U2 - 10.48550/arXiv.1812.10387

DO - 10.48550/arXiv.1812.10387

M3 - Conference contribution

AN - SCOPUS:85065658346

SN - 978-1-4503-5933-7

SP - 1019

EP - 1026

BT - SAC '19

PB - Association for Computing Machinery (ACM)

CY - New York

Y2 - 8 April 2019 through 12 April 2019

ER -

Research@Leibniz University

Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty

Authors

Research Organisations

External Research Organisations

Details

Abstract

Keywords

ASJC Scopus subject areas

Cite this