Details
Original language | English |
---|---|
Title of host publication | Proceedings of the 17th ACM International Conference on Web Search and Data Mining |
Subtitle of host publication | WSDM ’24 |
Pages | 683-692 |
Number of pages | 10 |
ISBN (electronic) | 9798400703713 |
Publication status | Published - 4 Mar 2024 |
Event | 17th ACM International Conference on Web Search and Data Mining, WSDM 2024 - Merida, Mexico Duration: 4 Mar 2024 → 8 Mar 2024 |
Abstract
Large language models (LLMs) have recently gained significant attention due to their unparalleled zero-shot performance on various natural language processing tasks. However, the pre-Training data utilized in LLMs is often confined to a specific corpus, resulting in inherent freshness and temporal scope limitations. Consequently, this raises concerns regarding the effectiveness of LLMs for tasks involving temporal intents. In this study, we aim to investigate the underlying limitations of general-purpose LLMs when deployed for tasks that require a temporal understanding. We pay particular attention to handling factual temporal knowledge through three popular temporal QA datasets. Specifically, we observe low performance on detailed questions about the past and, surprisingly, for rather new information. In manual and automatic testing, we find multiple temporal errors and characterize the conditions under which QA performance deteriorates. Our analysis contributes to understanding LLM limitations and offers valuable insights into developing future models that can better cater to the demands of temporally-oriented tasks. The code is available https://github.com/jwallat/temporalblindspots.
Keywords
- large language models, question answering, temporal information retrieval, temporal query intents
ASJC Scopus subject areas
- Computer Science(all)
- Computer Networks and Communications
- Computer Science(all)
- Computer Science Applications
- Computer Science(all)
- Software
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
Proceedings of the 17th ACM International Conference on Web Search and Data Mining: WSDM ’24. 2024. p. 683-692.
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Temporal Blind Spots in Large Language Models
AU - Wallat, Jonas
AU - Jatowt, Adam
AU - Anand, Avishek
N1 - Funding Information: This research was partially funded by the Federal Ministry of Education and Research (BMBF), Germany under the project LeibnizKILabor with grant No. 01DD20003 and Cubra with grant No. 13N16052
PY - 2024/3/4
Y1 - 2024/3/4
N2 - Large language models (LLMs) have recently gained significant attention due to their unparalleled zero-shot performance on various natural language processing tasks. However, the pre-Training data utilized in LLMs is often confined to a specific corpus, resulting in inherent freshness and temporal scope limitations. Consequently, this raises concerns regarding the effectiveness of LLMs for tasks involving temporal intents. In this study, we aim to investigate the underlying limitations of general-purpose LLMs when deployed for tasks that require a temporal understanding. We pay particular attention to handling factual temporal knowledge through three popular temporal QA datasets. Specifically, we observe low performance on detailed questions about the past and, surprisingly, for rather new information. In manual and automatic testing, we find multiple temporal errors and characterize the conditions under which QA performance deteriorates. Our analysis contributes to understanding LLM limitations and offers valuable insights into developing future models that can better cater to the demands of temporally-oriented tasks. The code is available https://github.com/jwallat/temporalblindspots.
AB - Large language models (LLMs) have recently gained significant attention due to their unparalleled zero-shot performance on various natural language processing tasks. However, the pre-Training data utilized in LLMs is often confined to a specific corpus, resulting in inherent freshness and temporal scope limitations. Consequently, this raises concerns regarding the effectiveness of LLMs for tasks involving temporal intents. In this study, we aim to investigate the underlying limitations of general-purpose LLMs when deployed for tasks that require a temporal understanding. We pay particular attention to handling factual temporal knowledge through three popular temporal QA datasets. Specifically, we observe low performance on detailed questions about the past and, surprisingly, for rather new information. In manual and automatic testing, we find multiple temporal errors and characterize the conditions under which QA performance deteriorates. Our analysis contributes to understanding LLM limitations and offers valuable insights into developing future models that can better cater to the demands of temporally-oriented tasks. The code is available https://github.com/jwallat/temporalblindspots.
KW - large language models
KW - question answering
KW - temporal information retrieval
KW - temporal query intents
UR - http://www.scopus.com/inward/record.url?scp=85191716137&partnerID=8YFLogxK
U2 - 10.48550/arXiv.2401.12078
DO - 10.48550/arXiv.2401.12078
M3 - Conference contribution
AN - SCOPUS:85191716137
SP - 683
EP - 692
BT - Proceedings of the 17th ACM International Conference on Web Search and Data Mining
T2 - 17th ACM International Conference on Web Search and Data Mining, WSDM 2024
Y2 - 4 March 2024 through 8 March 2024
ER -