Details
Original language | English |
---|---|
Title of host publication | Conference on Digital Curation Technologies |
Subtitle of host publication | Proceedings of the Conference on Digital Curation Technologies (Qurator 2021) |
Number of pages | 11 |
Publication status | Published - 2021 |
Event | 2nd International Conference on Digital Curation Technologies, Qurator 2021 - Berlin, Germany Duration: 8 Feb 2021 → 12 Feb 2021 |
Publication series
Name | CEUR Workshop Proceedings |
---|---|
Publisher | CEUR Workshop Proceedings |
Volume | 2836 |
ISSN (Print) | 1613-0073 |
Abstract
With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.
Keywords
- Data extraction, Multilingualism, Named-entity
ASJC Scopus subject areas
- Computer Science(all)
- General Computer Science
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021). 2021. (CEUR Workshop Proceedings; Vol. 2836).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia
AU - Alves, Diego
AU - Thakkar, Gaurish
AU - Amaral, Gabriel
AU - Kuculo, Tin
AU - Tadic, Marko
N1 - Funding Information: The work presented in this paper has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sk lodowska-Curie grant agreement no. 812997 and under the name CLEOPATRA (Cross-lingual Event-centric Open Analytics Research Academy).
PY - 2021
Y1 - 2021
N2 - With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.
AB - With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.
KW - Data extraction
KW - Multilingualism
KW - Named-entity
UR - http://www.scopus.com/inward/record.url?scp=85103263626&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85103263626
T3 - CEUR Workshop Proceedings
BT - Conference on Digital Curation Technologies
T2 - 2nd International Conference on Digital Curation Technologies, Qurator 2021
Y2 - 8 February 2021 through 12 February 2021
ER -