POLTER: Policy Trajectory Ensemble Regularization for Unsupervised Reinforcement Learning

Frederik Schubert; Carolin Benjamins; Sebastian Döhler; Bodo Rosenhahn; Marius Lindauer

doi:10.48550/arXiv.2205.11357

Details

Original language	English
Journal	Transactions on Machine Learning Research
Volume	2023
Issue number	4
Publication status	Published - Apr 2023

Abstract

The goal of Unsupervised Reinforcement Learning ( URL) is to find a reward agnostic prior policy on a task domain, such that the sample-efficiency on supervised downstream tasks is improved. Although agents initialized with such a prior policy can achieve a significantly higher reward with fewer samples when finetuned on the downstream task, it is still an open question how an optimal pretrained prior policy can be achieved in practice. In this work, we present POLTER (Policy Trajectory Ensemble Regularization) – a general method to regularize the pretraining that can be applied to any URL algorithm and is especially useful on data- and knowledge-based URL algorithms. It utilizes an ensemble of policies that are discovered during pretraining and moves the policy of the URL algorithm closer to its optimal prior. Our method is based on a theoretical framework, and we analyze its practical effects on a white-box benchmark, allowing us to study POLTER with full control. In our main experiments, we evaluate POLTER on the Unsupervised Reinforcement Learning Benchmark ( URLB ), which consists of 12 tasks in 3 domains. We demonstrate the generality of our approach by improving the performance of a diverse set of data- and knowledge-based URL algorithms by 19% on average and up to 40% in the best case. Under a fair comparison with tuned baselines and tuned POLTER, we establish a new state-of-the-art for model-free methods on the URLB

Keywords

cs.LG, cs.RO

Cite this

POLTER: Policy Trajectory Ensemble Regularization for Unsupervised Reinforcement Learning. / Schubert, Frederik; Benjamins, Carolin; Döhler, Sebastian et al.
In: Transactions on Machine Learning Research, Vol. 2023, No. 4, 04.2023.

Research output: Contribution to journal › Article › Research › peer review

Schubert, F, Benjamins, C, Döhler, S, Rosenhahn, B & Lindauer, M 2023, 'POLTER: Policy Trajectory Ensemble Regularization for Unsupervised Reinforcement Learning', Transactions on Machine Learning Research, vol. 2023, no. 4. https://doi.org/10.48550/arXiv.2205.11357

Schubert, F., Benjamins, C., Döhler, S., Rosenhahn, B., & Lindauer, M. (2023). POLTER: Policy Trajectory Ensemble Regularization for Unsupervised Reinforcement Learning. Transactions on Machine Learning Research, 2023(4). https://doi.org/10.48550/arXiv.2205.11357

Schubert F, Benjamins C, Döhler S, Rosenhahn B , Lindauer M. POLTER: Policy Trajectory Ensemble Regularization for Unsupervised Reinforcement Learning. Transactions on Machine Learning Research. 2023 Apr;2023(4). doi: 10.48550/arXiv.2205.11357

Schubert, Frederik ; Benjamins, Carolin ; Döhler, Sebastian et al. / POLTER : Policy Trajectory Ensemble Regularization for Unsupervised Reinforcement Learning. In: Transactions on Machine Learning Research. 2023 ; Vol. 2023, No. 4.

Download

@article{94b4cdfd591745eb96f2700c5cf0e7a5,

title = "POLTER: Policy Trajectory Ensemble Regularization for Unsupervised Reinforcement Learning",

abstract = "The goal of Unsupervised Reinforcement Learning ( URL) is to find a reward agnostic prior policy on a task domain, such that the sample-efficiency on supervised downstream tasks is improved. Although agents initialized with such a prior policy can achieve a significantly higher reward with fewer samples when finetuned on the downstream task, it is still an open question how an optimal pretrained prior policy can be achieved in practice. In this work, we present POLTER (Policy Trajectory Ensemble Regularization) – a general method to regularize the pretraining that can be applied to any URL algorithm and is especially useful on data- and knowledge-based URL algorithms. It utilizes an ensemble of policies that are discovered during pretraining and moves the policy of the URL algorithm closer to its optimal prior. Our method is based on a theoretical framework, and we analyze its practical effects on a white-box benchmark, allowing us to study POLTER with full control. In our main experiments, we evaluate POLTER on the Unsupervised Reinforcement Learning Benchmark ( URLB ), which consists of 12 tasks in 3 domains. We demonstrate the generality of our approach by improving the performance of a diverse set of data- and knowledge-based URL algorithms by 19% on average and up to 40% in the best case. Under a fair comparison with tuned baselines and tuned POLTER, we establish a new state-of-the-art for model-free methods on the URLB",

keywords = "cs.LG, cs.RO",

author = "Frederik Schubert and Carolin Benjamins and Sebastian D{\"o}hler and Bodo Rosenhahn and Marius Lindauer",

year = "2023",

month = apr,

doi = "10.48550/arXiv.2205.11357",

language = "English",

volume = "2023",

number = "4",

}

Download

TY - JOUR

T1 - POLTER

T2 - Policy Trajectory Ensemble Regularization for Unsupervised Reinforcement Learning

AU - Schubert, Frederik

AU - Benjamins, Carolin

AU - Döhler, Sebastian

AU - Rosenhahn, Bodo

AU - Lindauer, Marius

PY - 2023/4

Y1 - 2023/4

N2 - The goal of Unsupervised Reinforcement Learning ( URL) is to find a reward agnostic prior policy on a task domain, such that the sample-efficiency on supervised downstream tasks is improved. Although agents initialized with such a prior policy can achieve a significantly higher reward with fewer samples when finetuned on the downstream task, it is still an open question how an optimal pretrained prior policy can be achieved in practice. In this work, we present POLTER (Policy Trajectory Ensemble Regularization) – a general method to regularize the pretraining that can be applied to any URL algorithm and is especially useful on data- and knowledge-based URL algorithms. It utilizes an ensemble of policies that are discovered during pretraining and moves the policy of the URL algorithm closer to its optimal prior. Our method is based on a theoretical framework, and we analyze its practical effects on a white-box benchmark, allowing us to study POLTER with full control. In our main experiments, we evaluate POLTER on the Unsupervised Reinforcement Learning Benchmark ( URLB ), which consists of 12 tasks in 3 domains. We demonstrate the generality of our approach by improving the performance of a diverse set of data- and knowledge-based URL algorithms by 19% on average and up to 40% in the best case. Under a fair comparison with tuned baselines and tuned POLTER, we establish a new state-of-the-art for model-free methods on the URLB

AB - The goal of Unsupervised Reinforcement Learning ( URL) is to find a reward agnostic prior policy on a task domain, such that the sample-efficiency on supervised downstream tasks is improved. Although agents initialized with such a prior policy can achieve a significantly higher reward with fewer samples when finetuned on the downstream task, it is still an open question how an optimal pretrained prior policy can be achieved in practice. In this work, we present POLTER (Policy Trajectory Ensemble Regularization) – a general method to regularize the pretraining that can be applied to any URL algorithm and is especially useful on data- and knowledge-based URL algorithms. It utilizes an ensemble of policies that are discovered during pretraining and moves the policy of the URL algorithm closer to its optimal prior. Our method is based on a theoretical framework, and we analyze its practical effects on a white-box benchmark, allowing us to study POLTER with full control. In our main experiments, we evaluate POLTER on the Unsupervised Reinforcement Learning Benchmark ( URLB ), which consists of 12 tasks in 3 domains. We demonstrate the generality of our approach by improving the performance of a diverse set of data- and knowledge-based URL algorithms by 19% on average and up to 40% in the best case. Under a fair comparison with tuned baselines and tuned POLTER, we establish a new state-of-the-art for model-free methods on the URLB

KW - cs.LG

KW - cs.RO

U2 - 10.48550/arXiv.2205.11357

DO - 10.48550/arXiv.2205.11357

M3 - Article

VL - 2023

JO - Transactions on Machine Learning Research

JF - Transactions on Machine Learning Research

SN - 2835-8856

IS - 4

ER -

Research@Leibniz University

POLTER: Policy Trajectory Ensemble Regularization for Unsupervised Reinforcement Learning

Authors

Research Organisations

Details

Abstract

Keywords

Cite this

By the same author(s)

Robust Shape Fitting for 3D Scene Abstraction

Quantum normalizing flows for anomaly detection

AMLTK: A Modular AutoML Toolkit in Python

A variational autoencoder trained with priors from canonical pathways increases the interpretability of transcriptome data

AutoML in Heavily Constrained Applications