Details
Original language | English |
---|---|
Title of host publication | ICTIR '23 |
Subtitle of host publication | Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval |
Pages | 39-50 |
Number of pages | 12 |
ISBN (electronic) | 9798400700736 |
Publication status | Published - 9 Aug 2023 |
Event | 9th ACM SIGIR International Conference on the Theory of Information Retrieval: ICTIR 2023 - Taipei, Taiwan Duration: 23 Jul 2023 → 23 Jul 2023 |
Abstract
When asked, large language models∼(LLMs) like ChatGPT claim that they can assist with relevance judgments but it is not clear whether automated judgments can reliably be used in evaluations of retrieval systems. In this perspectives paper, we discuss possible ways for∼LLMs to support relevance judgments along with concerns and issues that arise. We devise a human - machine collaboration spectrum that allows to categorize different relevance judgment strategies, based on how much humans rely on machines. For the extreme point of 'fully automated judgments', we further include a pilot experiment on whether LLM-based relevance judgments correlate with judgments from trained human assessors. We conclude the paper by providing opposing perspectives for and against the use of∼LLMs for automatic relevance judgments, and a compromise perspective, informed by our analyses of the literature, our preliminary experimental evidence, and our experience as IR∼researchers.
Keywords
- automatic test collections, human - machine collaboration, large language models, relevance judgments
ASJC Scopus subject areas
- Computer Science(all)
- Computer Science (miscellaneous)
- Computer Science(all)
- Information Systems
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
ICTIR '23: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 2023. p. 39-50.
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Perspectives on Large Language Models for Relevance Judgment
AU - Faggioli, Guglielmo
AU - Clarke, Charles L.A.
AU - Demartini, Gianluca
AU - Hagen, Matthias
AU - Hauff, Claudia
AU - Kando, Noriko
AU - Kanoulas, Evangelos
AU - Potthast, Martin
AU - Stein, Benno
AU - Wachsmuth, Henning
AU - Dietz, Laura
N1 - Funding Information: This material is based upon work supported by the National Science Foundation under Grant No. 1846017. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
PY - 2023/8/9
Y1 - 2023/8/9
N2 - When asked, large language models∼(LLMs) like ChatGPT claim that they can assist with relevance judgments but it is not clear whether automated judgments can reliably be used in evaluations of retrieval systems. In this perspectives paper, we discuss possible ways for∼LLMs to support relevance judgments along with concerns and issues that arise. We devise a human - machine collaboration spectrum that allows to categorize different relevance judgment strategies, based on how much humans rely on machines. For the extreme point of 'fully automated judgments', we further include a pilot experiment on whether LLM-based relevance judgments correlate with judgments from trained human assessors. We conclude the paper by providing opposing perspectives for and against the use of∼LLMs for automatic relevance judgments, and a compromise perspective, informed by our analyses of the literature, our preliminary experimental evidence, and our experience as IR∼researchers.
AB - When asked, large language models∼(LLMs) like ChatGPT claim that they can assist with relevance judgments but it is not clear whether automated judgments can reliably be used in evaluations of retrieval systems. In this perspectives paper, we discuss possible ways for∼LLMs to support relevance judgments along with concerns and issues that arise. We devise a human - machine collaboration spectrum that allows to categorize different relevance judgment strategies, based on how much humans rely on machines. For the extreme point of 'fully automated judgments', we further include a pilot experiment on whether LLM-based relevance judgments correlate with judgments from trained human assessors. We conclude the paper by providing opposing perspectives for and against the use of∼LLMs for automatic relevance judgments, and a compromise perspective, informed by our analyses of the literature, our preliminary experimental evidence, and our experience as IR∼researchers.
KW - automatic test collections
KW - human - machine collaboration
KW - large language models
KW - relevance judgments
UR - http://www.scopus.com/inward/record.url?scp=85171444604&partnerID=8YFLogxK
U2 - 10.48550/arXiv.2304.09161
DO - 10.48550/arXiv.2304.09161
M3 - Conference contribution
AN - SCOPUS:85171444604
SP - 39
EP - 50
BT - ICTIR '23
T2 - 9th ACM SIGIR International Conference on the Theory of Information Retrieval
Y2 - 23 July 2023 through 23 July 2023
ER -