Details
Originalsprache | Englisch |
---|---|
Titel des Sammelwerks | Advances in Information Retrieval |
Untertitel | 46th European Conference on Information Retrieval, ECIR 2024 |
Herausgeber/-innen | Nazli Goharian, Nicola Tonellotto, Yulan He, Aldo Lipani, Graham McDonald, Craig Macdonald, Iadh Ounis |
Herausgeber (Verlag) | Springer Science and Business Media Deutschland GmbH |
Seiten | 108-122 |
Seitenumfang | 15 |
ISBN (elektronisch) | 978-3-031-56027-9 |
ISBN (Print) | 9783031560262 |
Publikationsstatus | Veröffentlicht - 20 März 2024 |
Veranstaltung | 46th European Conference on Information Retrieval, ECIR 2024 - Glasgow, Großbritannien / Vereinigtes Königreich Dauer: 24 März 2024 → 28 März 2024 |
Publikationsreihe
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Band | 14608 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (elektronisch) | 1611-3349 |
Abstract
Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims [13] propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.
ASJC Scopus Sachgebiete
- Mathematik (insg.)
- Theoretische Informatik
- Informatik (insg.)
- Allgemeine Computerwissenschaft
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024. Hrsg. / Nazli Goharian; Nicola Tonellotto; Yulan He; Aldo Lipani; Graham McDonald; Craig Macdonald; Iadh Ounis. Springer Science and Business Media Deutschland GmbH, 2024. S. 108-122 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Band 14608 LNCS).
Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review
}
TY - GEN
T1 - Learning Action Embeddings for Off-Policy Evaluation
AU - Cief, Matej
AU - Golebiowski, Jacek
AU - Schmidt, Philipp
AU - Abedjan, Ziawasch
AU - Bekasov, Artur
N1 - Funding Information: The research conducted by Matej Cief (also with slovak.AI) was partially supported by TAILOR, a project funded by EU Horizon 2020 under GA No. 952215, https://doi.org/10.3030/952215.
PY - 2024/3/20
Y1 - 2024/3/20
N2 - Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims [13] propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.
AB - Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims [13] propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.
KW - large action space
KW - multi-armed bandits
KW - off-policy evaluation
KW - recommender systems
KW - representation learning
UR - http://www.scopus.com/inward/record.url?scp=85189744882&partnerID=8YFLogxK
U2 - 10.48550/arXiv.2305.03954
DO - 10.48550/arXiv.2305.03954
M3 - Conference contribution
AN - SCOPUS:85189744882
SN - 9783031560262
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 108
EP - 122
BT - Advances in Information Retrieval
A2 - Goharian, Nazli
A2 - Tonellotto, Nicola
A2 - He, Yulan
A2 - Lipani, Aldo
A2 - McDonald, Graham
A2 - Macdonald, Craig
A2 - Ounis, Iadh
PB - Springer Science and Business Media Deutschland GmbH
T2 - 46th European Conference on Information Retrieval, ECIR 2024
Y2 - 24 March 2024 through 28 March 2024
ER -