TIGQA: An Expert-Annotated Question-Answering Dataset in Tigrinya

Hailay Kidu Teklehaymanot; Dren Fazlija; Niloy Ganguly; Gourab K. Patro; Wolfgang Nejdl

Details

Original language	English
Title of host publication	Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Subtitle of host publication	LREC-COLING 2024
Editors	Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Pages	16142-16161
Number of pages	20
ISBN (electronic)	9782493814104
Publication status	Published - 2024
Event	Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 - Hybrid, Torino, Italy Duration: 20 May 2024 → 25 May 2024

Abstract

The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources. This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert-annotated dataset containing 2,685 question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the-art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pre-trained models. The notable disparities between human performance and the best model performance underscore the potential for future enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.

Keywords

domain specific QA, Low resource QA dataset, Tigrinya QA dataset

ASJC Scopus subject areas

Mathematics(all)
Theoretical Computer Science
Computer Science(all)
Computational Theory and Mathematics
Computer Science(all)
Computer Science Applications

Cite this

TIGQA: An Expert-Annotated Question-Answering Dataset in Tigrinya. / Teklehaymanot, Hailay Kidu; Fazlija, Dren; Ganguly, Niloy et al.
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024. ed. / Nicoletta Calzolari; Min-Yen Kan; Veronique Hoste; Alessandro Lenci; Sakriani Sakti; Nianwen Xue. 2024. p. 16142-16161.

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Teklehaymanot, HK, Fazlija, D, Ganguly, N, Patro, GK & Nejdl, W 2024, TIGQA: An Expert-Annotated Question-Answering Dataset in Tigrinya. in N Calzolari, M-Y Kan, V Hoste, A Lenci, S Sakti & N Xue (eds), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024. pp. 16142-16161, Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, 20 May 2024. <https://aclanthology.org/2024.lrec-main.1404/>

Teklehaymanot, H. K., Fazlija, D., Ganguly, N., Patro, G. K., & Nejdl, W. (2024). TIGQA: An Expert-Annotated Question-Answering Dataset in Tigrinya. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024 (pp. 16142-16161) https://aclanthology.org/2024.lrec-main.1404/

Teklehaymanot HK, Fazlija D, Ganguly N, Patro GK, Nejdl W. TIGQA: An Expert-Annotated Question-Answering Dataset in Tigrinya. In Calzolari N, Kan MY, Hoste V, Lenci A, Sakti S, Xue N, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024. 2024. p. 16142-16161

Teklehaymanot, Hailay Kidu ; Fazlija, Dren ; Ganguly, Niloy et al. / TIGQA : An Expert-Annotated Question-Answering Dataset in Tigrinya. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024. editor / Nicoletta Calzolari ; Min-Yen Kan ; Veronique Hoste ; Alessandro Lenci ; Sakriani Sakti ; Nianwen Xue. 2024. pp. 16142-16161

Download

@inproceedings{32c465bea8bc42cebac647c6472db91e,

title = "TIGQA: An Expert-Annotated Question-Answering Dataset in Tigrinya",

abstract = "The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources. This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert-annotated dataset containing 2,685 question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the-art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pre-trained models. The notable disparities between human performance and the best model performance underscore the potential for future enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.",

keywords = "domain specific QA, Low resource QA dataset, Tigrinya QA dataset",

author = "Teklehaymanot, {Hailay Kidu} and Dren Fazlija and Niloy Ganguly and Patro, {Gourab K.} and Wolfgang Nejdl",

note = "Publisher Copyright: {\textcopyright} 2024 ELRA Language Resource Association: CC BY-NC 4.0.; Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 ; Conference date: 20-05-2024 Through 25-05-2024",

year = "2024",

language = "English",

pages = "16142--16161",

editor = "Nicoletta Calzolari and Min-Yen Kan and Veronique Hoste and Alessandro Lenci and Sakriani Sakti and Nianwen Xue",

booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation",

}

Download

TY - GEN

T1 - TIGQA

T2 - Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024

AU - Teklehaymanot, Hailay Kidu

AU - Fazlija, Dren

AU - Ganguly, Niloy

AU - Patro, Gourab K.

AU - Nejdl, Wolfgang

PY - 2024

Y1 - 2024

N2 - The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources. This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert-annotated dataset containing 2,685 question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the-art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pre-trained models. The notable disparities between human performance and the best model performance underscore the potential for future enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.

AB - The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources. This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert-annotated dataset containing 2,685 question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the-art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pre-trained models. The notable disparities between human performance and the best model performance underscore the potential for future enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.

KW - domain specific QA

KW - Low resource QA dataset

KW - Tigrinya QA dataset

UR - http://www.scopus.com/inward/record.url?scp=85195903895&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85195903895

SP - 16142

EP - 16161

BT - Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation

A2 - Calzolari, Nicoletta

A2 - Kan, Min-Yen

A2 - Hoste, Veronique

A2 - Lenci, Alessandro

A2 - Sakti, Sakriani

A2 - Xue, Nianwen

Y2 - 20 May 2024 through 25 May 2024

ER -

Research@Leibniz University

TIGQA: An Expert-Annotated Question-Answering Dataset in Tigrinya

Authors

Research Organisations

External Research Organisations

Details

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

An artificial intelligence-assisted clinical framework to facilitate diagnostics and translational discovery in hematologic neoplasia