Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

Diego Alves; Gaurish Thakkar; Gabriel Amaral; Tin Kuculo; Marko Tadic

Details

Original language	English
Title of host publication	Conference on Digital Curation Technologies
Subtitle of host publication	Proceedings of the Conference on Digital Curation Technologies (Qurator 2021)
Number of pages	11
Publication status	Published - 2021
Event	2nd International Conference on Digital Curation Technologies, Qurator 2021 - Berlin, Germany Duration: 8 Feb 2021 → 12 Feb 2021

Publication series

Name	CEUR Workshop Proceedings
Publisher	CEUR Workshop Proceedings
Volume	2836
ISSN (Print)	1613-0073

Abstract

With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.

Keywords

Data extraction, Multilingualism, Named-entity

ASJC Scopus subject areas

Computer Science(all)
General Computer Science

Cite this

Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. / Alves, Diego; Thakkar, Gaurish; Amaral, Gabriel et al.
Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021). 2021. (CEUR Workshop Proceedings; Vol. 2836).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Alves, D, Thakkar, G, Amaral, G, Kuculo, T & Tadic, M 2021, Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. in Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021). CEUR Workshop Proceedings, vol. 2836, 2nd International Conference on Digital Curation Technologies, Qurator 2021, Berlin, Germany, 8 Feb 2021. <https://ceur-ws.org/Vol-2836/qurator2021_paper_17.pdf>

Alves, D., Thakkar, G., Amaral, G., Kuculo, T., & Tadic, M. (2021). Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. In Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021) (CEUR Workshop Proceedings; Vol. 2836). https://ceur-ws.org/Vol-2836/qurator2021_paper_17.pdf

Alves D, Thakkar G, Amaral G, Kuculo T, Tadic M. Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. In Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021). 2021. (CEUR Workshop Proceedings).

Alves, Diego ; Thakkar, Gaurish ; Amaral, Gabriel et al. / Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021). 2021. (CEUR Workshop Proceedings).

Download

@inproceedings{fd6106a968cd44d8a08f574c0d517322,

title = "Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia",

abstract = "With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.",

keywords = "Data extraction, Multilingualism, Named-entity",

author = "Diego Alves and Gaurish Thakkar and Gabriel Amaral and Tin Kuculo and Marko Tadic",

note = "Funding Information: The work presented in this paper has received funding from the European Union{\textquoteright}s Horizon 2020 research and innovation program under the Marie Sk lodowska-Curie grant agreement no. 812997 and under the name CLEOPATRA (Cross-lingual Event-centric Open Analytics Research Academy).; 2nd International Conference on Digital Curation Technologies, Qurator 2021 ; Conference date: 08-02-2021 Through 12-02-2021",

year = "2021",

language = "English",

series = "CEUR Workshop Proceedings",

publisher = "CEUR Workshop Proceedings",

booktitle = "Conference on Digital Curation Technologies",

}

Download

TY - GEN

T1 - Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

AU - Alves, Diego

AU - Thakkar, Gaurish

AU - Amaral, Gabriel

AU - Kuculo, Tin

AU - Tadic, Marko

N1 - Funding Information: The work presented in this paper has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sk lodowska-Curie grant agreement no. 812997 and under the name CLEOPATRA (Cross-lingual Event-centric Open Analytics Research Academy).

PY - 2021

Y1 - 2021

N2 - With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.

AB - With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.

KW - Data extraction

KW - Multilingualism

KW - Named-entity

UR - http://www.scopus.com/inward/record.url?scp=85103263626&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85103263626

T3 - CEUR Workshop Proceedings

BT - Conference on Digital Curation Technologies

T2 - 2nd International Conference on Digital Curation Technologies, Qurator 2021

Y2 - 8 February 2021 through 12 February 2021

ER -

Research@Leibniz University

Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

Authors

Research Organisations

External Research Organisations