Perspectives on Large Language Models for Relevance Judgment

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • Guglielmo Faggioli
  • Charles L.A. Clarke
  • Gianluca Demartini
  • Matthias Hagen
  • Claudia Hauff
  • Noriko Kando
  • Evangelos Kanoulas
  • Martin Potthast
  • Benno Stein
  • Henning Wachsmuth
  • Laura Dietz

Research Organisations

External Research Organisations

  • University of Padova
  • University of Waterloo
  • University of Queensland
  • Friedrich Schiller University Jena
  • Spotify
  • Research Organization of Information and Systems National Institute of Informatics
  • University of Amsterdam
  • Leipzig University
  • Bauhaus-Universität Weimar
  • University of New Hampshire
View graph of relations

Details

Original languageEnglish
Title of host publicationICTIR '23
Subtitle of host publicationProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval
Pages39-50
Number of pages12
ISBN (electronic)9798400700736
Publication statusPublished - 9 Aug 2023
Event9th ACM SIGIR International Conference on the Theory of Information Retrieval: ICTIR 2023 - Taipei, Taiwan
Duration: 23 Jul 202323 Jul 2023

Abstract

When asked, large language models∼(LLMs) like ChatGPT claim that they can assist with relevance judgments but it is not clear whether automated judgments can reliably be used in evaluations of retrieval systems. In this perspectives paper, we discuss possible ways for∼LLMs to support relevance judgments along with concerns and issues that arise. We devise a human - machine collaboration spectrum that allows to categorize different relevance judgment strategies, based on how much humans rely on machines. For the extreme point of 'fully automated judgments', we further include a pilot experiment on whether LLM-based relevance judgments correlate with judgments from trained human assessors. We conclude the paper by providing opposing perspectives for and against the use of∼LLMs for automatic relevance judgments, and a compromise perspective, informed by our analyses of the literature, our preliminary experimental evidence, and our experience as IR∼researchers.

Keywords

    automatic test collections, human - machine collaboration, large language models, relevance judgments

ASJC Scopus subject areas

Cite this

Perspectives on Large Language Models for Relevance Judgment. / Faggioli, Guglielmo; Clarke, Charles L.A.; Demartini, Gianluca et al.
ICTIR '23: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 2023. p. 39-50.

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Faggioli, G, Clarke, CLA, Demartini, G, Hagen, M, Hauff, C, Kando, N, Kanoulas, E, Potthast, M, Stein, B, Wachsmuth, H & Dietz, L 2023, Perspectives on Large Language Models for Relevance Judgment. in ICTIR '23: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. pp. 39-50, 9th ACM SIGIR International Conference on the Theory of Information Retrieval, Taipei, Taiwan, 23 Jul 2023. https://doi.org/10.48550/arXiv.2304.09161, https://doi.org/10.1145/3578337.3605136
Faggioli, G., Clarke, C. L. A., Demartini, G., Hagen, M., Hauff, C., Kando, N., Kanoulas, E., Potthast, M., Stein, B., Wachsmuth, H., & Dietz, L. (2023). Perspectives on Large Language Models for Relevance Judgment. In ICTIR '23: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval (pp. 39-50) https://doi.org/10.48550/arXiv.2304.09161, https://doi.org/10.1145/3578337.3605136
Faggioli G, Clarke CLA, Demartini G, Hagen M, Hauff C, Kando N et al. Perspectives on Large Language Models for Relevance Judgment. In ICTIR '23: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 2023. p. 39-50 doi: 10.48550/arXiv.2304.09161, 10.1145/3578337.3605136
Faggioli, Guglielmo ; Clarke, Charles L.A. ; Demartini, Gianluca et al. / Perspectives on Large Language Models for Relevance Judgment. ICTIR '23: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 2023. pp. 39-50
Download
@inproceedings{7582791698b742b2a1dffcd353510682,
title = "Perspectives on Large Language Models for Relevance Judgment",
abstract = "When asked, large language models∼(LLMs) like ChatGPT claim that they can assist with relevance judgments but it is not clear whether automated judgments can reliably be used in evaluations of retrieval systems. In this perspectives paper, we discuss possible ways for∼LLMs to support relevance judgments along with concerns and issues that arise. We devise a human - machine collaboration spectrum that allows to categorize different relevance judgment strategies, based on how much humans rely on machines. For the extreme point of 'fully automated judgments', we further include a pilot experiment on whether LLM-based relevance judgments correlate with judgments from trained human assessors. We conclude the paper by providing opposing perspectives for and against the use of∼LLMs for automatic relevance judgments, and a compromise perspective, informed by our analyses of the literature, our preliminary experimental evidence, and our experience as IR∼researchers.",
keywords = "automatic test collections, human - machine collaboration, large language models, relevance judgments",
author = "Guglielmo Faggioli and Clarke, {Charles L.A.} and Gianluca Demartini and Matthias Hagen and Claudia Hauff and Noriko Kando and Evangelos Kanoulas and Martin Potthast and Benno Stein and Henning Wachsmuth and Laura Dietz",
note = "Funding Information: This material is based upon work supported by the National Science Foundation under Grant No. 1846017. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. ; 9th ACM SIGIR International Conference on the Theory of Information Retrieval : ICTIR 2023 ; Conference date: 23-07-2023 Through 23-07-2023",
year = "2023",
month = aug,
day = "9",
doi = "10.48550/arXiv.2304.09161",
language = "English",
pages = "39--50",
booktitle = "ICTIR '23",

}

Download

TY - GEN

T1 - Perspectives on Large Language Models for Relevance Judgment

AU - Faggioli, Guglielmo

AU - Clarke, Charles L.A.

AU - Demartini, Gianluca

AU - Hagen, Matthias

AU - Hauff, Claudia

AU - Kando, Noriko

AU - Kanoulas, Evangelos

AU - Potthast, Martin

AU - Stein, Benno

AU - Wachsmuth, Henning

AU - Dietz, Laura

N1 - Funding Information: This material is based upon work supported by the National Science Foundation under Grant No. 1846017. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

PY - 2023/8/9

Y1 - 2023/8/9

N2 - When asked, large language models∼(LLMs) like ChatGPT claim that they can assist with relevance judgments but it is not clear whether automated judgments can reliably be used in evaluations of retrieval systems. In this perspectives paper, we discuss possible ways for∼LLMs to support relevance judgments along with concerns and issues that arise. We devise a human - machine collaboration spectrum that allows to categorize different relevance judgment strategies, based on how much humans rely on machines. For the extreme point of 'fully automated judgments', we further include a pilot experiment on whether LLM-based relevance judgments correlate with judgments from trained human assessors. We conclude the paper by providing opposing perspectives for and against the use of∼LLMs for automatic relevance judgments, and a compromise perspective, informed by our analyses of the literature, our preliminary experimental evidence, and our experience as IR∼researchers.

AB - When asked, large language models∼(LLMs) like ChatGPT claim that they can assist with relevance judgments but it is not clear whether automated judgments can reliably be used in evaluations of retrieval systems. In this perspectives paper, we discuss possible ways for∼LLMs to support relevance judgments along with concerns and issues that arise. We devise a human - machine collaboration spectrum that allows to categorize different relevance judgment strategies, based on how much humans rely on machines. For the extreme point of 'fully automated judgments', we further include a pilot experiment on whether LLM-based relevance judgments correlate with judgments from trained human assessors. We conclude the paper by providing opposing perspectives for and against the use of∼LLMs for automatic relevance judgments, and a compromise perspective, informed by our analyses of the literature, our preliminary experimental evidence, and our experience as IR∼researchers.

KW - automatic test collections

KW - human - machine collaboration

KW - large language models

KW - relevance judgments

UR - http://www.scopus.com/inward/record.url?scp=85171444604&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2304.09161

DO - 10.48550/arXiv.2304.09161

M3 - Conference contribution

AN - SCOPUS:85171444604

SP - 39

EP - 50

BT - ICTIR '23

T2 - 9th ACM SIGIR International Conference on the Theory of Information Retrieval

Y2 - 23 July 2023 through 23 July 2023

ER -

By the same author(s)