Details
Originalsprache | Englisch |
---|---|
Titel des Sammelwerks | Conference on Digital Curation Technologies |
Untertitel | Proceedings of the Conference on Digital Curation Technologies (Qurator 2021) |
Seitenumfang | 11 |
Publikationsstatus | Veröffentlicht - 2021 |
Veranstaltung | 2nd International Conference on Digital Curation Technologies, Qurator 2021 - Berlin, Deutschland Dauer: 8 Feb. 2021 → 12 Feb. 2021 |
Publikationsreihe
Name | CEUR Workshop Proceedings |
---|---|
Herausgeber (Verlag) | CEUR Workshop Proceedings |
Band | 2836 |
ISSN (Print) | 1613-0073 |
Abstract
With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.
ASJC Scopus Sachgebiete
- Informatik (insg.)
- Allgemeine Computerwissenschaft
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
Conference on Digital Curation Technologies: Proceedings of the Conference on Digital Curation Technologies (Qurator 2021). 2021. (CEUR Workshop Proceedings; Band 2836).
Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review
}
TY - GEN
T1 - Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia
AU - Alves, Diego
AU - Thakkar, Gaurish
AU - Amaral, Gabriel
AU - Kuculo, Tin
AU - Tadic, Marko
N1 - Funding Information: The work presented in this paper has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sk lodowska-Curie grant agreement no. 812997 and under the name CLEOPATRA (Cross-lingual Event-centric Open Analytics Research Academy).
PY - 2021
Y1 - 2021
N2 - With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.
AB - With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper1, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.
KW - Data extraction
KW - Multilingualism
KW - Named-entity
UR - http://www.scopus.com/inward/record.url?scp=85103263626&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85103263626
T3 - CEUR Workshop Proceedings
BT - Conference on Digital Curation Technologies
T2 - 2nd International Conference on Digital Curation Technologies, Qurator 2021
Y2 - 8 February 2021 through 12 February 2021
ER -