Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • Diego Alves
  • Gaurish Thakkar
  • Gabriel Amaral
  • Tin Kuculo
  • Marko Tadic

Research Organisations

External Research Organisations

  • University of Zagreb
  • King's College London
View graph of relations

Details

Original languageEnglish
Title of host publicationConference on Digital Curation Technologies
Subtitle of host publicationProceedings of the Conference on Digital Curation Technologies (Qurator 2021)
Number of pages11
Publication statusPublished - 2021
Event2nd International Conference on Digital Curation Technologies, Qurator 2021 - Berlin, Germany
Duration: 8 Feb 202112 Feb 2021

Publication series

NameCEUR Workshop Proceedings
PublisherCEUR Workshop Proceedings
Volume2836
ISSN (Print)1613-0073

Abstract

With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.

Keywords

    Data extraction, Multilingualism, Named-entity

ASJC Scopus subject areas

Cite this

Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. / Alves, Diego; Thakkar, Gaurish; Amaral, Gabriel et al.
Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021). 2021. (CEUR Workshop Proceedings; Vol. 2836).

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Alves, D, Thakkar, G, Amaral, G, Kuculo, T & Tadic, M 2021, Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. in Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021). CEUR Workshop Proceedings, vol. 2836, 2nd International Conference on Digital Curation Technologies, Qurator 2021, Berlin, Germany, 8 Feb 2021. <https://ceur-ws.org/Vol-2836/qurator2021_paper_17.pdf>
Alves, D., Thakkar, G., Amaral, G., Kuculo, T., & Tadic, M. (2021). Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. In Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021) (CEUR Workshop Proceedings; Vol. 2836). https://ceur-ws.org/Vol-2836/qurator2021_paper_17.pdf
Alves D, Thakkar G, Amaral G, Kuculo T, Tadic M. Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. In Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021). 2021. (CEUR Workshop Proceedings).
Alves, Diego ; Thakkar, Gaurish ; Amaral, Gabriel et al. / Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia. Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021). 2021. (CEUR Workshop Proceedings).
Download
@inproceedings{fd6106a968cd44d8a08f574c0d517322,
title = "Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia",
abstract = "With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.",
keywords = "Data extraction, Multilingualism, Named-entity",
author = "Diego Alves and Gaurish Thakkar and Gabriel Amaral and Tin Kuculo and Marko Tadic",
note = "Funding Information: The work presented in this paper has received funding from the European Union{\textquoteright}s Horizon 2020 research and innovation program under the Marie Sk lodowska-Curie grant agreement no. 812997 and under the name CLEOPATRA (Cross-lingual Event-centric Open Analytics Research Academy).; 2nd International Conference on Digital Curation Technologies, Qurator 2021 ; Conference date: 08-02-2021 Through 12-02-2021",
year = "2021",
language = "English",
series = "CEUR Workshop Proceedings",
publisher = "CEUR Workshop Proceedings",
booktitle = "Conference on Digital Curation Technologies",

}

Download

TY - GEN

T1 - Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

AU - Alves, Diego

AU - Thakkar, Gaurish

AU - Amaral, Gabriel

AU - Kuculo, Tin

AU - Tadic, Marko

N1 - Funding Information: The work presented in this paper has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sk lodowska-Curie grant agreement no. 812997 and under the name CLEOPATRA (Cross-lingual Event-centric Open Analytics Research Academy).

PY - 2021

Y1 - 2021

N2 - With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.

AB - With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.

KW - Data extraction

KW - Multilingualism

KW - Named-entity

UR - http://www.scopus.com/inward/record.url?scp=85103263626&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85103263626

T3 - CEUR Workshop Proceedings

BT - Conference on Digital Curation Technologies

T2 - 2nd International Conference on Digital Curation Technologies, Qurator 2021

Y2 - 8 February 2021 through 12 February 2021

ER -