Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Soumyadeep Roy; Aparup Khatua; Fatemeh Ghoochani; Uwe Hadler; Wolfgang Nejdl; Niloy Ganguly

doi:10.48550/arXiv.2404.13307

Details

Originalsprache	Englisch
Titel des Sammelwerks	Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval
Seiten	1073-1082
Seitenumfang	10
ISBN (elektronisch)	9798400704314
Publikationsstatus	Veröffentlicht - 11 Juli 2024
Veranstaltung	47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024 - Washington, USA / Vereinigte Staaten Dauer: 14 Juli 2024 → 18 Juli 2024

Abstract

GPT-4 demonstrates high accuracy in medical QA tasks, leading with an accuracy of 86.70%, followed by Med-PaLM 2 at 86.50%. However, around 14% of errors remain. Additionally, current works use GPT-4 to only predict the correct option without providing any explanation and thus do not provide any insight into the thinking process and reasoning used by GPT-4 or other LLMs. Therefore, we introduce a new domain-specific error taxonomy derived from collaboration with medical students. Our GPT-4 USMLE Error (G4UE) dataset comprises 4153 GPT-4 correct responses and 919 incorrect responses to the United States Medical Licensing Examination (USMLE) respectively. These responses are quite long (258 words on average), containing detailed explanations from GPT-4 justifying the selected option. We then launch a large-scale annotation study using the Potato annotation platform and recruit 44 medical experts through Prolific, a well-known crowdsourcing platform. We annotated 300 out of these 919 incorrect data points at a granular level for different classes and created a multi-label span to identify the reasons behind the error. In our annotated dataset, a substantial portion of GPT-4's incorrect responses is categorized as a "Reasonable response by GPT-4,"by annotators. This sheds light on the challenge of discerning explanations that may lead to incorrect options, even among trained medical professionals. We also provide medical concepts and medical semantic predications extracted using the SemRep tool for every data point. We believe that it will aid in evaluating the ability of LLMs to answer complex medical questions. We make the resources available at https://github.com/roysoumya/usmle-gpt4-error-taxonomy.

ASJC Scopus Sachgebiete

Informatik (insg.)
Information systems
Informatik (insg.)
Software

Zitieren

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions. / Roy, Soumyadeep; Khatua, Aparup; Ghoochani, Fatemeh et al.
Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2024. S. 1073-1082.

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review

Roy, S, Khatua, A, Ghoochani, F, Hadler, U, Nejdl, W & Ganguly, N 2024, Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions. in Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. S. 1073-1082, 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024, Washington, USA / Vereinigte Staaten, 14 Juli 2024. https://doi.org/10.48550/arXiv.2404.13307, https://doi.org/10.1145/3626772.3657882

Roy, S., Khatua, A., Ghoochani, F., Hadler, U., Nejdl, W., & Ganguly, N. (2024). Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (S. 1073-1082) https://doi.org/10.48550/arXiv.2404.13307, https://doi.org/10.1145/3626772.3657882

Roy S, Khatua A, Ghoochani F, Hadler U, Nejdl W, Ganguly N. Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions. in Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2024. S. 1073-1082 doi: 10.48550/arXiv.2404.13307, 10.1145/3626772.3657882

Roy, Soumyadeep ; Khatua, Aparup ; Ghoochani, Fatemeh et al. / Beyond Accuracy : Investigating Error Types in GPT-4 Responses to USMLE Questions. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2024. S. 1073-1082

Download

@inproceedings{5591867fa11347d882999a6c95ef2c06,

title = "Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions",

abstract = "GPT-4 demonstrates high accuracy in medical QA tasks, leading with an accuracy of 86.70%, followed by Med-PaLM 2 at 86.50%. However, around 14% of errors remain. Additionally, current works use GPT-4 to only predict the correct option without providing any explanation and thus do not provide any insight into the thinking process and reasoning used by GPT-4 or other LLMs. Therefore, we introduce a new domain-specific error taxonomy derived from collaboration with medical students. Our GPT-4 USMLE Error (G4UE) dataset comprises 4153 GPT-4 correct responses and 919 incorrect responses to the United States Medical Licensing Examination (USMLE) respectively. These responses are quite long (258 words on average), containing detailed explanations from GPT-4 justifying the selected option. We then launch a large-scale annotation study using the Potato annotation platform and recruit 44 medical experts through Prolific, a well-known crowdsourcing platform. We annotated 300 out of these 919 incorrect data points at a granular level for different classes and created a multi-label span to identify the reasons behind the error. In our annotated dataset, a substantial portion of GPT-4's incorrect responses is categorized as a {"}Reasonable response by GPT-4,{"}by annotators. This sheds light on the challenge of discerning explanations that may lead to incorrect options, even among trained medical professionals. We also provide medical concepts and medical semantic predications extracted using the SemRep tool for every data point. We believe that it will aid in evaluating the ability of LLMs to answer complex medical questions. We make the resources available at https://github.com/roysoumya/usmle-gpt4-error-taxonomy.",

keywords = "gpt-4, medical qa, multi-label dataset, usmle error taxonomy",

author = "Soumyadeep Roy and Aparup Khatua and Fatemeh Ghoochani and Uwe Hadler and Wolfgang Nejdl and Niloy Ganguly",

note = "Publisher Copyright: {\textcopyright} 2024 Owner/Author.; 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024 ; Conference date: 14-07-2024 Through 18-07-2024",

year = "2024",

month = jul,

day = "11",

doi = "10.48550/arXiv.2404.13307",

language = "English",

pages = "1073--1082",

booktitle = "Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval",

}

Download

TY - GEN

T1 - Beyond Accuracy

T2 - 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024

AU - Roy, Soumyadeep

AU - Khatua, Aparup

AU - Ghoochani, Fatemeh

AU - Hadler, Uwe

AU - Nejdl, Wolfgang

AU - Ganguly, Niloy

PY - 2024/7/11

Y1 - 2024/7/11

N2 - GPT-4 demonstrates high accuracy in medical QA tasks, leading with an accuracy of 86.70%, followed by Med-PaLM 2 at 86.50%. However, around 14% of errors remain. Additionally, current works use GPT-4 to only predict the correct option without providing any explanation and thus do not provide any insight into the thinking process and reasoning used by GPT-4 or other LLMs. Therefore, we introduce a new domain-specific error taxonomy derived from collaboration with medical students. Our GPT-4 USMLE Error (G4UE) dataset comprises 4153 GPT-4 correct responses and 919 incorrect responses to the United States Medical Licensing Examination (USMLE) respectively. These responses are quite long (258 words on average), containing detailed explanations from GPT-4 justifying the selected option. We then launch a large-scale annotation study using the Potato annotation platform and recruit 44 medical experts through Prolific, a well-known crowdsourcing platform. We annotated 300 out of these 919 incorrect data points at a granular level for different classes and created a multi-label span to identify the reasons behind the error. In our annotated dataset, a substantial portion of GPT-4's incorrect responses is categorized as a "Reasonable response by GPT-4,"by annotators. This sheds light on the challenge of discerning explanations that may lead to incorrect options, even among trained medical professionals. We also provide medical concepts and medical semantic predications extracted using the SemRep tool for every data point. We believe that it will aid in evaluating the ability of LLMs to answer complex medical questions. We make the resources available at https://github.com/roysoumya/usmle-gpt4-error-taxonomy.

AB - GPT-4 demonstrates high accuracy in medical QA tasks, leading with an accuracy of 86.70%, followed by Med-PaLM 2 at 86.50%. However, around 14% of errors remain. Additionally, current works use GPT-4 to only predict the correct option without providing any explanation and thus do not provide any insight into the thinking process and reasoning used by GPT-4 or other LLMs. Therefore, we introduce a new domain-specific error taxonomy derived from collaboration with medical students. Our GPT-4 USMLE Error (G4UE) dataset comprises 4153 GPT-4 correct responses and 919 incorrect responses to the United States Medical Licensing Examination (USMLE) respectively. These responses are quite long (258 words on average), containing detailed explanations from GPT-4 justifying the selected option. We then launch a large-scale annotation study using the Potato annotation platform and recruit 44 medical experts through Prolific, a well-known crowdsourcing platform. We annotated 300 out of these 919 incorrect data points at a granular level for different classes and created a multi-label span to identify the reasons behind the error. In our annotated dataset, a substantial portion of GPT-4's incorrect responses is categorized as a "Reasonable response by GPT-4,"by annotators. This sheds light on the challenge of discerning explanations that may lead to incorrect options, even among trained medical professionals. We also provide medical concepts and medical semantic predications extracted using the SemRep tool for every data point. We believe that it will aid in evaluating the ability of LLMs to answer complex medical questions. We make the resources available at https://github.com/roysoumya/usmle-gpt4-error-taxonomy.

KW - gpt-4

KW - medical qa

KW - multi-label dataset

KW - usmle error taxonomy

UR - http://www.scopus.com/inward/record.url?scp=85199188807&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2404.13307

DO - 10.48550/arXiv.2404.13307

M3 - Conference contribution

AN - SCOPUS:85199188807

SP - 1073

EP - 1082

BT - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

Y2 - 14 July 2024 through 18 July 2024

ER -

Research@Leibniz University

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Autorschaft

Organisationseinheiten

Externe Organisationen

Details

Abstract

ASJC Scopus Sachgebiete

Zitieren

Von denselben Autoren

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Revisiting Clinical Outcome Prediction for MIMIC-IV