Moments Matter: Stabilizing Policy Optimization using Return Distributions

Dennis Jabs; Aditya Mohan; Marius Lindauer

Details

Original language	English
Title of host publication	2025 Multi-disciplinary Conference on Reinforcement Learning and Decision Making (RLDM 2025)
Publication status	Accepted/In press - 15 Feb 2025

Abstract

Deep Reinforcement Learning (RL) agents often learn policies that achieve the same episodic return yet behave very differently, due to a combination of environmental (random transitions, initial conditions, reward noise) and algorithmic (minibatch selection, exploration noise) factors. In continuous control tasks, even small parameter shifts can produce unstable gaits, complicating both algorithm comparison and real-world transfer. Previous work has shown that such instability arises when policy updates traverse noisy neighborhoods and that the negative tail of post-update return distribution R(θ) – obtained by repeatedly sampling minibatches, updating θ, and measuring final returns – is a useful indicator of this noise. Although explicitly constraining the policy to maintain a narrow R(θ) can improve stability, directly estimating R(θ) is computationally expensive in high-dimensional settings. We propose an alternative that takes advantage of environmental stochasticity to mitigate update-induced variability. Specifically, we model the state-action return distribution via a distributional critic and then bias the advantage function of PPO using higher-order moments (skewness and kurtosis) of this distribution. By penalizing extreme tail behaviors, our method discourages policies from entering parameter regimes prone to instability. We hypothesize that in environments where post-update critic values align poorly with post-update returns, standard PPO struggles to produce a narrow R(θ). In such cases, our moment-based correction significantly narrows R(θ), improving stability by up to 75% in Walker2D while preserving comparable evaluation returns.

Cite this

Moments Matter: Stabilizing Policy Optimization using Return Distributions. / Jabs, Dennis; Mohan, Aditya ; Lindauer, Marius.
2025 Multi-disciplinary Conference on Reinforcement Learning and Decision Making (RLDM 2025). 2025.

Research output: Chapter in book/report/conference proceeding › Conference abstract › Research › peer review

Jabs, D, Mohan, A & Lindauer, M 2025, Moments Matter: Stabilizing Policy Optimization using Return Distributions. in 2025 Multi-disciplinary Conference on Reinforcement Learning and Decision Making (RLDM 2025).

Jabs, D., Mohan, A., & Lindauer, M. (Accepted/in press). Moments Matter: Stabilizing Policy Optimization using Return Distributions. In 2025 Multi-disciplinary Conference on Reinforcement Learning and Decision Making (RLDM 2025)

Jabs D, Mohan A , Lindauer M. Moments Matter: Stabilizing Policy Optimization using Return Distributions. In 2025 Multi-disciplinary Conference on Reinforcement Learning and Decision Making (RLDM 2025). 2025

Jabs, Dennis ; Mohan, Aditya ; Lindauer, Marius. / Moments Matter: Stabilizing Policy Optimization using Return Distributions. 2025 Multi-disciplinary Conference on Reinforcement Learning and Decision Making (RLDM 2025). 2025.

Download

@inbook{ae4b6b45499a440da00ed80cb6f6516e,

title = "Moments Matter: Stabilizing Policy Optimization using Return Distributions",

abstract = "Deep Reinforcement Learning (RL) agents often learn policies that achieve the same episodic return yet behave very differently, due to a combination of environmental (random transitions, initial conditions, reward noise) and algorithmic (minibatch selection, exploration noise) factors. In continuous control tasks, even small parameter shifts can produce unstable gaits, complicating both algorithm comparison and real-world transfer. Previous work has shown that such instability arises when policy updates traverse noisy neighborhoods and that the negative tail of post-update return distribution R(θ) – obtained by repeatedly sampling minibatches, updating θ, and measuring final returns – is a useful indicator of this noise. Although explicitly constraining the policy to maintain a narrow R(θ) can improve stability, directly estimating R(θ) is computationally expensive in high-dimensional settings. We propose an alternative that takes advantage of environmental stochasticity to mitigate update-induced variability. Specifically, we model the state-action return distribution via a distributional critic and then bias the advantage function of PPO using higher-order moments (skewness and kurtosis) of this distribution. By penalizing extreme tail behaviors, our method discourages policies from entering parameter regimes prone to instability. We hypothesize that in environments where post-update critic values align poorly with post-update returns, standard PPO struggles to produce a narrow R(θ). In such cases, our moment-based correction significantly narrows R(θ), improving stability by up to 75% in Walker2D while preserving comparable evaluation returns.",

author = "Dennis Jabs and Aditya Mohan and Marius Lindauer",

year = "2025",

month = feb,

day = "15",

language = "English",

booktitle = "2025 Multi-disciplinary Conference on Reinforcement Learning and Decision Making (RLDM 2025)",

}

Download

TY - CHAP

T1 - Moments Matter: Stabilizing Policy Optimization using Return Distributions

AU - Jabs, Dennis

AU - Mohan, Aditya

AU - Lindauer, Marius

PY - 2025/2/15

Y1 - 2025/2/15

N2 - Deep Reinforcement Learning (RL) agents often learn policies that achieve the same episodic return yet behave very differently, due to a combination of environmental (random transitions, initial conditions, reward noise) and algorithmic (minibatch selection, exploration noise) factors. In continuous control tasks, even small parameter shifts can produce unstable gaits, complicating both algorithm comparison and real-world transfer. Previous work has shown that such instability arises when policy updates traverse noisy neighborhoods and that the negative tail of post-update return distribution R(θ) – obtained by repeatedly sampling minibatches, updating θ, and measuring final returns – is a useful indicator of this noise. Although explicitly constraining the policy to maintain a narrow R(θ) can improve stability, directly estimating R(θ) is computationally expensive in high-dimensional settings. We propose an alternative that takes advantage of environmental stochasticity to mitigate update-induced variability. Specifically, we model the state-action return distribution via a distributional critic and then bias the advantage function of PPO using higher-order moments (skewness and kurtosis) of this distribution. By penalizing extreme tail behaviors, our method discourages policies from entering parameter regimes prone to instability. We hypothesize that in environments where post-update critic values align poorly with post-update returns, standard PPO struggles to produce a narrow R(θ). In such cases, our moment-based correction significantly narrows R(θ), improving stability by up to 75% in Walker2D while preserving comparable evaluation returns.

AB - Deep Reinforcement Learning (RL) agents often learn policies that achieve the same episodic return yet behave very differently, due to a combination of environmental (random transitions, initial conditions, reward noise) and algorithmic (minibatch selection, exploration noise) factors. In continuous control tasks, even small parameter shifts can produce unstable gaits, complicating both algorithm comparison and real-world transfer. Previous work has shown that such instability arises when policy updates traverse noisy neighborhoods and that the negative tail of post-update return distribution R(θ) – obtained by repeatedly sampling minibatches, updating θ, and measuring final returns – is a useful indicator of this noise. Although explicitly constraining the policy to maintain a narrow R(θ) can improve stability, directly estimating R(θ) is computationally expensive in high-dimensional settings. We propose an alternative that takes advantage of environmental stochasticity to mitigate update-induced variability. Specifically, we model the state-action return distribution via a distributional critic and then bias the advantage function of PPO using higher-order moments (skewness and kurtosis) of this distribution. By penalizing extreme tail behaviors, our method discourages policies from entering parameter regimes prone to instability. We hypothesize that in environments where post-update critic values align poorly with post-update returns, standard PPO struggles to produce a narrow R(θ). In such cases, our moment-based correction significantly narrows R(θ), improving stability by up to 75% in Walker2D while preserving comparable evaluation returns.

M3 - Conference abstract

BT - 2025 Multi-disciplinary Conference on Reinforcement Learning and Decision Making (RLDM 2025)

ER -

Research@Leibniz University

Moments Matter: Stabilizing Policy Optimization using Return Distributions

Authors

Research Organisations

Details

Abstract

Cite this

By the same author(s)

AMLTK: A Modular AutoML Toolkit in Python

AutoML in Heavily Constrained Applications

Verfahren zum Trainieren eines Algorithmus des maschinellen Lernens durch ein bestärkendes Lernverfahren

MO-SMAC: Multi-objective Sequential Model-based Algorithm Configuration

How Green is AutoML for Tabular Data?

AMLTK: A Modular AutoML Toolkit in Python

AutoML in Heavily Constrained Applications

Verfahren zum Trainieren eines Algorithmus des maschinellen Lernens durch ein bestärkendes Lernverfahren

MO-SMAC: Multi-objective Sequential Model-based Algorithm Configuration

How Green is AutoML for Tabular Data?

AMLTK: A Modular AutoML Toolkit in Python