Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

Diego Alves; Gaurish Thakkar; Gabriel Amaral; Tin Kuculo; Marko Tadic

Details

Originalsprache	Englisch
Titel des Sammelwerks	Conference on Digital Curation Technologies
Untertitel	Proceedings of the Conference on Digital Curation Technologies (Qurator 2021)
Seitenumfang	11
Publikationsstatus	Veröffentlicht - 2021
Veranstaltung	2nd International Conference on Digital Curation Technologies, Qurator 2021 - Berlin, Deutschland Dauer: 8 Feb. 2021 → 12 Feb. 2021

Publikationsreihe

Name	CEUR Workshop Proceedings
Herausgeber (Verlag)	CEUR Workshop Proceedings
Band	2836
ISSN (Print)	1613-0073

Abstract

With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.

ASJC Scopus Sachgebiete

Informatik (insg.)
Allgemeine Computerwissenschaft

Zitieren

Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. / Alves, Diego; Thakkar, Gaurish; Amaral, Gabriel et al.
Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021). 2021. (CEUR Workshop Proceedings; Band 2836).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review

Alves, D, Thakkar, G, Amaral, G, Kuculo, T & Tadic, M 2021, Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. in Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021). CEUR Workshop Proceedings, Bd. 2836, 2nd International Conference on Digital Curation Technologies, Qurator 2021, Berlin, Deutschland, 8 Feb. 2021. <https://ceur-ws.org/Vol-2836/qurator2021_paper_17.pdf>

Alves, D., Thakkar, G., Amaral, G., Kuculo, T., & Tadic, M. (2021). Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. In Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021) (CEUR Workshop Proceedings; Band 2836). https://ceur-ws.org/Vol-2836/qurator2021_paper_17.pdf

Alves D, Thakkar G, Amaral G, Kuculo T, Tadic M. Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. in Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021). 2021. (CEUR Workshop Proceedings).

Alves, Diego ; Thakkar, Gaurish ; Amaral, Gabriel et al. / Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021). 2021. (CEUR Workshop Proceedings).

Download

@inproceedings{fd6106a968cd44d8a08f574c0d517322,

title = "Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia",

abstract = "With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.",

keywords = "Data extraction, Multilingualism, Named-entity",

author = "Diego Alves and Gaurish Thakkar and Gabriel Amaral and Tin Kuculo and Marko Tadic",

note = "Funding Information: The work presented in this paper has received funding from the European Union{\textquoteright}s Horizon 2020 research and innovation program under the Marie Sk lodowska-Curie grant agreement no. 812997 and under the name CLEOPATRA (Cross-lingual Event-centric Open Analytics Research Academy).; 2nd International Conference on Digital Curation Technologies, Qurator 2021 ; Conference date: 08-02-2021 Through 12-02-2021",

year = "2021",

language = "English",

series = "CEUR Workshop Proceedings",

publisher = "CEUR Workshop Proceedings",

booktitle = "Conference on Digital Curation Technologies",

}

Download

TY - GEN

T1 - Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

AU - Alves, Diego

AU - Thakkar, Gaurish

AU - Amaral, Gabriel

AU - Kuculo, Tin

AU - Tadic, Marko

N1 - Funding Information: The work presented in this paper has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sk lodowska-Curie grant agreement no. 812997 and under the name CLEOPATRA (Cross-lingual Event-centric Open Analytics Research Academy).

PY - 2021

Y1 - 2021

N2 - With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.

AB - With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.

KW - Data extraction

KW - Multilingualism

KW - Named-entity

UR - http://www.scopus.com/inward/record.url?scp=85103263626&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85103263626

T3 - CEUR Workshop Proceedings

BT - Conference on Digital Curation Technologies

T2 - 2nd International Conference on Digital Curation Technologies, Qurator 2021

Y2 - 8 February 2021 through 12 February 2021

ER -

Research@Leibniz University

Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

Autorschaft

Organisationseinheiten

Externe Organisationen

Details

Publikationsreihe

Abstract

ASJC Scopus Sachgebiete

Zitieren