Details
Original language | English |
---|---|
Journal | Transactions on Machine Learning Research |
Volume | 2023 |
Issue number | 4 |
Publication status | Published - Apr 2023 |
Abstract
Keywords
- cs.LG, cs.RO
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
In: Transactions on Machine Learning Research, Vol. 2023, No. 4, 04.2023.
Research output: Contribution to journal › Article › Research › peer review
}
TY - JOUR
T1 - POLTER
T2 - Policy Trajectory Ensemble Regularization for Unsupervised Reinforcement Learning
AU - Schubert, Frederik
AU - Benjamins, Carolin
AU - Döhler, Sebastian
AU - Rosenhahn, Bodo
AU - Lindauer, Marius
PY - 2023/4
Y1 - 2023/4
N2 - The goal of Unsupervised Reinforcement Learning ( URL) is to find a reward agnostic prior policy on a task domain, such that the sample-efficiency on supervised downstream tasks is improved. Although agents initialized with such a prior policy can achieve a significantly higher reward with fewer samples when finetuned on the downstream task, it is still an open question how an optimal pretrained prior policy can be achieved in practice. In this work, we present POLTER (Policy Trajectory Ensemble Regularization) – a general method to regularize the pretraining that can be applied to any URL algorithm and is especially useful on data- and knowledge-based URL algorithms. It utilizes an ensemble of policies that are discovered during pretraining and moves the policy of the URL algorithm closer to its optimal prior. Our method is based on a theoretical framework, and we analyze its practical effects on a white-box benchmark, allowing us to study POLTER with full control. In our main experiments, we evaluate POLTER on the Unsupervised Reinforcement Learning Benchmark ( URLB ), which consists of 12 tasks in 3 domains. We demonstrate the generality of our approach by improving the performance of a diverse set of data- and knowledge-based URL algorithms by 19% on average and up to 40% in the best case. Under a fair comparison with tuned baselines and tuned POLTER, we establish a new state-of-the-art for model-free methods on the URLB
AB - The goal of Unsupervised Reinforcement Learning ( URL) is to find a reward agnostic prior policy on a task domain, such that the sample-efficiency on supervised downstream tasks is improved. Although agents initialized with such a prior policy can achieve a significantly higher reward with fewer samples when finetuned on the downstream task, it is still an open question how an optimal pretrained prior policy can be achieved in practice. In this work, we present POLTER (Policy Trajectory Ensemble Regularization) – a general method to regularize the pretraining that can be applied to any URL algorithm and is especially useful on data- and knowledge-based URL algorithms. It utilizes an ensemble of policies that are discovered during pretraining and moves the policy of the URL algorithm closer to its optimal prior. Our method is based on a theoretical framework, and we analyze its practical effects on a white-box benchmark, allowing us to study POLTER with full control. In our main experiments, we evaluate POLTER on the Unsupervised Reinforcement Learning Benchmark ( URLB ), which consists of 12 tasks in 3 domains. We demonstrate the generality of our approach by improving the performance of a diverse set of data- and knowledge-based URL algorithms by 19% on average and up to 40% in the best case. Under a fair comparison with tuned baselines and tuned POLTER, we establish a new state-of-the-art for model-free methods on the URLB
KW - cs.LG
KW - cs.RO
U2 - 10.48550/arXiv.2205.11357
DO - 10.48550/arXiv.2205.11357
M3 - Article
VL - 2023
JO - Transactions on Machine Learning Research
JF - Transactions on Machine Learning Research
SN - 2835-8856
IS - 4
ER -