[Question] How to pass a varying gamma to DQN or PPO during training? #1889

rariss · 2024-04-10T16:43:11Z

❓ Question

Reinforcement learning and the SB3 implementations apply the typical constant gamma for discounting future values when learning. This is fine for discrete time environments where for each action the future value is discounted as a constant for each step.

I have a custom gym environment where my environment steps in discrete decision epochs, but each action takes a different amount of time. Discounting future values at a constant rate is therefore incorrect. What I need to do is discount future values by a gamma that is a function of the time it takes to conduct the action in the environment.

Is there anyway to pass in gamma as a function or as tensors that map to each (s, a, s’, r) tuple during learning? Maybe possible with existing features or callbacks? I’d like to avoid forking the repository if possible.

Any input would be appreciated as I’ve been stuck on this for some time. Thanks in advance!

Checklist

I have checked that there is no similar issue in the repo
I have read the documentation
If code there is, it is minimal and working
If code there is, it is formatted using the markdown code blocks for both code and stack traces.

araffin · 2024-04-10T17:09:38Z

Hello,
in your case, the best is to fork sb3 and adapt the rollout buffer/ppo.
This is too custom to be solved by callbacks or subclassing.

rariss · 2024-04-10T17:56:12Z

Thanks for your quick response.

So if I understand, you’re suggesting to augment the replay buffer to collect time-varying gammas with each rollout, then in the PPO loss function, use the gammas from the replay buffer?

araffin · 2024-04-10T18:17:12Z

you’re suggesting to augment the replay buffer to collect time-varying gammas with each rollout, then in the PPO loss function, use the gammas from the replay buffer?

correct, that would be actually a gamma per timesteps to make it simpler to use (and make it work with VecEnv for instance), and you would need to use that value everytime gamma is used (notably for the GAE computation)

rariss · 2024-04-10T18:57:29Z

Got it. Yes absolutely that gamma would be the discount factor for a “step” or discrete decision epoch.

rariss · 2024-04-10T18:59:44Z

If I augment the replay buffer, does all of is content get passed to the learn function? Meaning I dont need to modify the input for the training update functions, I just need to extract the gammas from the batch of step buffer data and use them in the GAE discount?

araffin · 2024-05-10T13:50:07Z

Meaning I dont need to modify the input for the training update functions,

you need to modify the named tuple that represent a transition and modify the GAE computation accordingly yes

rariss added the question Further information is requested label Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] How to pass a varying gamma to DQN or PPO during training? #1889

[Question] How to pass a varying gamma to DQN or PPO during training? #1889

rariss commented Apr 10, 2024

araffin commented Apr 10, 2024 •

edited

rariss commented Apr 10, 2024

araffin commented Apr 10, 2024

rariss commented Apr 10, 2024 •

edited

rariss commented Apr 10, 2024 •

edited

araffin commented May 10, 2024

[Question] How to pass a varying gamma to DQN or PPO during training? #1889

[Question] How to pass a varying gamma to DQN or PPO during training? #1889

Comments

rariss commented Apr 10, 2024

❓ Question

Checklist

araffin commented Apr 10, 2024 • edited

rariss commented Apr 10, 2024

araffin commented Apr 10, 2024

rariss commented Apr 10, 2024 • edited

rariss commented Apr 10, 2024 • edited

araffin commented May 10, 2024

araffin commented Apr 10, 2024 •

edited

rariss commented Apr 10, 2024 •

edited

rariss commented Apr 10, 2024 •

edited