TIGQA: An Expert-Annotated Question-Answering Dataset in Tigrinya

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autoren

  • Hailay Kidu Teklehaymanot
  • Dren Fazlija
  • Niloy Ganguly
  • Gourab K. Patro
  • Wolfgang Nejdl

Organisationseinheiten

Externe Organisationen

  • Indian Institute of Technology Kharagpur (IITKGP)
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des SammelwerksProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation
UntertitelLREC-COLING 2024
Herausgeber/-innenNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Seiten16142-16161
Seitenumfang20
ISBN (elektronisch)9782493814104
PublikationsstatusVeröffentlicht - 2024
VeranstaltungJoint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 - Hybrid, Torino, Italien
Dauer: 20 Mai 202425 Mai 2024

Abstract

The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources. This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert-annotated dataset containing 2,685 question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the-art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pre-trained models. The notable disparities between human performance and the best model performance underscore the potential for future enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.

ASJC Scopus Sachgebiete

Zitieren

TIGQA: An Expert-Annotated Question-Answering Dataset in Tigrinya. / Teklehaymanot, Hailay Kidu; Fazlija, Dren; Ganguly, Niloy et al.
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024. Hrsg. / Nicoletta Calzolari; Min-Yen Kan; Veronique Hoste; Alessandro Lenci; Sakriani Sakti; Nianwen Xue. 2024. S. 16142-16161.

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Teklehaymanot, HK, Fazlija, D, Ganguly, N, Patro, GK & Nejdl, W 2024, TIGQA: An Expert-Annotated Question-Answering Dataset in Tigrinya. in N Calzolari, M-Y Kan, V Hoste, A Lenci, S Sakti & N Xue (Hrsg.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024. S. 16142-16161, Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italien, 20 Mai 2024. <https://aclanthology.org/2024.lrec-main.1404/>
Teklehaymanot, H. K., Fazlija, D., Ganguly, N., Patro, G. K., & Nejdl, W. (2024). TIGQA: An Expert-Annotated Question-Answering Dataset in Tigrinya. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Hrsg.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024 (S. 16142-16161) https://aclanthology.org/2024.lrec-main.1404/
Teklehaymanot HK, Fazlija D, Ganguly N, Patro GK, Nejdl W. TIGQA: An Expert-Annotated Question-Answering Dataset in Tigrinya. in Calzolari N, Kan MY, Hoste V, Lenci A, Sakti S, Xue N, Hrsg., Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024. 2024. S. 16142-16161
Teklehaymanot, Hailay Kidu ; Fazlija, Dren ; Ganguly, Niloy et al. / TIGQA : An Expert-Annotated Question-Answering Dataset in Tigrinya. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024. Hrsg. / Nicoletta Calzolari ; Min-Yen Kan ; Veronique Hoste ; Alessandro Lenci ; Sakriani Sakti ; Nianwen Xue. 2024. S. 16142-16161
Download
@inproceedings{32c465bea8bc42cebac647c6472db91e,
title = "TIGQA: An Expert-Annotated Question-Answering Dataset in Tigrinya",
abstract = "The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources. This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert-annotated dataset containing 2,685 question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the-art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pre-trained models. The notable disparities between human performance and the best model performance underscore the potential for future enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.",
keywords = "domain specific QA, Low resource QA dataset, Tigrinya QA dataset",
author = "Teklehaymanot, {Hailay Kidu} and Dren Fazlija and Niloy Ganguly and Patro, {Gourab K.} and Wolfgang Nejdl",
note = "Publisher Copyright: {\textcopyright} 2024 ELRA Language Resource Association: CC BY-NC 4.0.; Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 ; Conference date: 20-05-2024 Through 25-05-2024",
year = "2024",
language = "English",
pages = "16142--16161",
editor = "Nicoletta Calzolari and Min-Yen Kan and Veronique Hoste and Alessandro Lenci and Sakriani Sakti and Nianwen Xue",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation",

}

Download

TY - GEN

T1 - TIGQA

T2 - Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024

AU - Teklehaymanot, Hailay Kidu

AU - Fazlija, Dren

AU - Ganguly, Niloy

AU - Patro, Gourab K.

AU - Nejdl, Wolfgang

N1 - Publisher Copyright: © 2024 ELRA Language Resource Association: CC BY-NC 4.0.

PY - 2024

Y1 - 2024

N2 - The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources. This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert-annotated dataset containing 2,685 question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the-art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pre-trained models. The notable disparities between human performance and the best model performance underscore the potential for future enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.

AB - The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources. This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert-annotated dataset containing 2,685 question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the-art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pre-trained models. The notable disparities between human performance and the best model performance underscore the potential for future enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.

KW - domain specific QA

KW - Low resource QA dataset

KW - Tigrinya QA dataset

UR - http://www.scopus.com/inward/record.url?scp=85195903895&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85195903895

SP - 16142

EP - 16161

BT - Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation

A2 - Calzolari, Nicoletta

A2 - Kan, Min-Yen

A2 - Hoste, Veronique

A2 - Lenci, Alessandro

A2 - Sakti, Sakriani

A2 - Xue, Nianwen

Y2 - 20 May 2024 through 25 May 2024

ER -

Von denselben Autoren