Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autorschaft

  • Diego Alves
  • Gaurish Thakkar
  • Gabriel Amaral
  • Tin Kuculo
  • Marko Tadic

Organisationseinheiten

Externe Organisationen

  • University of Zagreb
  • King's College London
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des SammelwerksConference on Digital Curation Technologies
UntertitelProceedings of the Conference on Digital Curation Technologies (Qurator 2021)
Seitenumfang11
PublikationsstatusVeröffentlicht - 2021
Veranstaltung2nd International Conference on Digital Curation Technologies, Qurator 2021 - Berlin, Deutschland
Dauer: 8 Feb. 202112 Feb. 2021

Publikationsreihe

NameCEUR Workshop Proceedings
Herausgeber (Verlag)CEUR Workshop Proceedings
Band2836
ISSN (Print)1613-0073

Abstract

With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.

ASJC Scopus Sachgebiete

Zitieren

Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. / Alves, Diego; Thakkar, Gaurish; Amaral, Gabriel et al.
Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021). 2021. (CEUR Workshop Proceedings; Band 2836).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Alves, D, Thakkar, G, Amaral, G, Kuculo, T & Tadic, M 2021, Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. in Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021). CEUR Workshop Proceedings, Bd. 2836, 2nd International Conference on Digital Curation Technologies, Qurator 2021, Berlin, Deutschland, 8 Feb. 2021. <https://ceur-ws.org/Vol-2836/qurator2021_paper_17.pdf>
Alves, D., Thakkar, G., Amaral, G., Kuculo, T., & Tadic, M. (2021). Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. In Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021) (CEUR Workshop Proceedings; Band 2836). https://ceur-ws.org/Vol-2836/qurator2021_paper_17.pdf
Alves D, Thakkar G, Amaral G, Kuculo T, Tadic M. Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. in Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021). 2021. (CEUR Workshop Proceedings).
Alves, Diego ; Thakkar, Gaurish ; Amaral, Gabriel et al. / Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021). 2021. (CEUR Workshop Proceedings).
Download
@inproceedings{fd6106a968cd44d8a08f574c0d517322,
title = "Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia",
abstract = "With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.",
keywords = "Data extraction, Multilingualism, Named-entity",
author = "Diego Alves and Gaurish Thakkar and Gabriel Amaral and Tin Kuculo and Marko Tadic",
note = "Funding Information: The work presented in this paper has received funding from the European Union{\textquoteright}s Horizon 2020 research and innovation program under the Marie Sk lodowska-Curie grant agreement no. 812997 and under the name CLEOPATRA (Cross-lingual Event-centric Open Analytics Research Academy).; 2nd International Conference on Digital Curation Technologies, Qurator 2021 ; Conference date: 08-02-2021 Through 12-02-2021",
year = "2021",
language = "English",
series = "CEUR Workshop Proceedings",
publisher = "CEUR Workshop Proceedings",
booktitle = "Conference on Digital Curation Technologies",

}

Download

TY - GEN

T1 - Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

AU - Alves, Diego

AU - Thakkar, Gaurish

AU - Amaral, Gabriel

AU - Kuculo, Tin

AU - Tadic, Marko

N1 - Funding Information: The work presented in this paper has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sk lodowska-Curie grant agreement no. 812997 and under the name CLEOPATRA (Cross-lingual Event-centric Open Analytics Research Academy).

PY - 2021

Y1 - 2021

N2 - With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.

AB - With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.

KW - Data extraction

KW - Multilingualism

KW - Named-entity

UR - http://www.scopus.com/inward/record.url?scp=85103263626&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85103263626

T3 - CEUR Workshop Proceedings

BT - Conference on Digital Curation Technologies

T2 - 2nd International Conference on Digital Curation Technologies, Qurator 2021

Y2 - 8 February 2021 through 12 February 2021

ER -