Prioritized experience replay #1622

AlexPasqua · 2023-07-23T17:34:10Z

Description

Implementation of prioritized replay buffer for DQN.
Closes #1242

Motivation and Context

I have raised an issue to propose this change (required for new features and bug fixes)

In accordance with #1242

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)

Checklist

Note: You can run most of the checks using make commit-checks.

Note: we are using a maximum length of 127 characters per line

- Created SumTree (to be ultimated) - Started PrioritizedReplayBuffer - constructor and 'sample' method - to be tested

AlexPasqua · 2023-08-06T15:13:08Z

@araffin could you (or anyone) please have a look at the 2 pytype errors? I don't quite understand how to fix them

stable_baselines3/dqn/prioritized_replay_buffer.py

AlexPasqua · 2023-09-29T10:00:41Z

Thanks @araffin !
Out of curiosity, may I ask why the switch between torch and numpy for the backend?

araffin · 2023-09-29T10:25:56Z

Thanks @araffin ! Out of curiosity, may I ask why the switch between torch and numpy for the backend?

to be consistent with the rest of the buffers and because PyTorch is not needed here (no gpu computation needed).

AlexPasqua · 2023-09-30T16:27:00Z

Hello @araffin ,
as you moved the code to "common", I suppose you plan to make it usable in algorithms other than DQN. At this point, wouldn't it be clearer to put the code into common/buffers.py? Let me know, and in case, I will move it there.

Added list of rainbow extensions, specifying which ones are currently implemented in the library

araffin · 2023-10-02T16:57:02Z

At this point, wouldn't it be clearer to put the code into common/buffers.py?

yes probably, but the most important thing for now is to test the implementation (performance test, check we can reproduce the results from the paper), document it and add additional tests/doc (for sumtree for instance).

araffin · 2023-10-04T19:26:10Z

performance test, check we can reproduce the results from the paper

After some initial test on Breakout following hyperparameters from the paper, the run didn't improve or worsen DQN performance so far...
I will try on other envs (it would be nice if you could help).

AlexPasqua · 2023-10-05T06:33:17Z

After some initial test on Breakout following hyperparameters from the paper, the run didn't improve or worsen DQN performance so far... I will try on other envs (it would be nice if you could help).

Thanks for starting to test it!
These days I'm travelling, and also writing a paper after work, but I'll try to squeeze some tests in

AlexPasqua · 2023-11-02T15:43:07Z

@araffin I've also done some initial tests and it looks like PER might lead to a slightly faster convergence, for example on cartpole, but nothing super evident unfortunately.
Next I'd like to properly reproduce some of the paper's experiment, but computational power could become a bit of an issue for me

qgallouedec · 2023-11-22T15:18:10Z

stable_baselines3/dqn/dqn.py

+ # Special case when using PrioritizedReplayBuffer (PER)
+ if isinstance(self.replay_buffer, PrioritizedReplayBuffer):
+ # TD error in absolute value
+ td_error = th.abs(current_q_values - target_q_values)
+ # Weighted Huber loss using importance sampling weights
+ loss = (replay_data.weights * th.where(td_error < 1.0, 0.5 * td_error**2, td_error - 0.5)).mean()
+ # Update priorities, they will be proportional to the td error
+ assert replay_data.leaf_nodes_indices is not None, "Node leaf node indices provided"
+ self.replay_buffer.update_priorities(
+ replay_data.leaf_nodes_indices, td_error, self._current_progress_remaining
+ )
+ else:
+ # Compute Huber loss (less sensitive to outliers)
+ loss = F.smooth_l1_loss(current_q_values, target_q_values)


@AlexPasqua Ideally, we'd like to be able to associate it with all off-policy algo's without adaptation, but I don't see a simple way of doing it at this stage.
Also related, we had discussed not modifying DQN: Stable-Baselines-Team/stable-baselines3-contrib#127 (comment)

I'm interested in this PR. Since every algo-specific train method includes a replay_buffer.sample line, couldn't we just additionally add a replay_buffer.update line? The update function could take in the current and target q values whenever a value function is present or maybe even all the local variables. It would do nothing for the vanilla replay buffer. Would this be an acceptable modification?

Thanks for your comment!
How do you handle the loss in your proposal?

If we want this to work for general off-policy algorithms, we could update the ReplayBufferSample-like classes to additionally include an importance_sampling_weight attribute which would be updated from the replay_buffer.update method.

Then I see two ways to handle the loss under this interface:

Estimate TD error from the loss as such:

losses = loss_fn(current_q_values, target_q_values, reduction='none') # e.g. If loss is L2, then it's basically th.sqrt(loss). If loss is L1, td_error = loss td_error = importance_sampling_weight * function_to_approx_td_error(losses) loss = losses.mean()

Obviously the downside of this is that it requires hand engineering for the different types of loss functions or priority metrics.

Make any value-based train methods "td-error" centric in the sense that we always compute td_error = importance_sampling_weight * th.abs(current_q_values - target_q_values) first, then the loss loss = loss_fn(td_error). The downsides of this approach is that we cant use the pytorch api for computing the loss, and would have to write functions for those.

Either approach requires computing a td_error variable which unfortunately requires somewhat intrusive code changes. What do you think?

maybe to make things clearer: my plan is not to have PER for all algorithms, mainly for two reasons:

Keep the code concise (in fact, I would like to have RAINBOW and keep vanilla DQN, see [Feature Request] RAINBOW #622)

I don't think it works for entropy-RL algorithms (SAC and derivates), so it would be limited to DQN/QR-DQN and TD3

If the users really want PER in other algo, they would take inspiration from a reference implementation in SB3 and integrate it (the same way we don't provide maskable + recurrent PPO at the same time).

"just" yes, I would be happy to receive such PR =)
the main thing is to benchmark the implementation and reproduce the published results.
This PR is also still open because I was not satisfied by the result of DQN + PER (I couldn't see significant different with respect to DQN).

One thing I had in mind was to implement CNN for SBX (https://github.com/araffin/sbx) in order to iterate faster and check the PER, but I had no time to do so until now...

Why don't we implement the toy environment from figure 1 of https://arxiv.org/pdf/1511.05952 as the PER benchmark? It would be a simpler initial check for correctness than the Atari environments

The toy environment can be a start for fast iteration and debugging, but what we learned in the past is that subtle bugs only show up when doing more complex task (see #48 and #47 where we found bugs like PyTorch and TF RMSProp are not the same)

I see, will definitely work towards it!

richardjozsa · 2023-11-30T22:32:13Z

Just a comment, I've tested this implementation with QR-DQN with Vecenv multiple environment but it fails because of the missing part.

But good job to start the work on it! I hope it will be merged soon! 👍

stable_baselines3/common/prioritized_replay_buffer.py

AlexPasqua and others added 8 commits July 22, 2023 21:31

Started PER

c607585

- Created SumTree (to be ultimated) - Started PrioritizedReplayBuffer - constructor and 'sample' method - to be tested

Added "add" method + other improvements

57f1192

Docstrings, type hints, doc

2b9df33

Merge branch 'master' into prioritized-experience-replay

31dc46c

Merge branch 'master' into prioritized-experience-replay

1a32377

Merge branch 'master' into prioritized-experience-replay

ccf7dc3

FIxed for pytype checks (partially)

aee1d30

make format

c51b173

araffin reviewed Aug 6, 2023

View reviewed changes

stable_baselines3/dqn/prioritized_replay_buffer.py Outdated Show resolved Hide resolved

araffin reviewed Aug 6, 2023

View reviewed changes

stable_baselines3/dqn/prioritized_replay_buffer.py Outdated Show resolved Hide resolved

Made pytype ignore type on PER's sample method

18c9d28

araffin added the Maintainers on vacation Maintainers are on vacation so they can recharge their batteries, we will be back soon ;) label Aug 10, 2023

Merge branch 'master' into prioritized-experience-replay

840dde2

araffin removed the Maintainers on vacation Maintainers are on vacation so they can recharge their batteries, we will be back soon ;) label Sep 4, 2023

araffin self-requested a review September 4, 2023 08:52

araffin added 3 commits September 29, 2023 10:19

Merge branch 'master' into prioritized-experience-replay

dcfbf88

Switch to numpy for the backend

fb33732

Move to common and add tests

f984e5c

AlexPasqua marked this pull request as ready for review September 29, 2023 09:58

AlexPasqua and others added 5 commits September 30, 2023 19:40

Updated DQN docs

5edf8bf

Added list of rainbow extensions, specifying which ones are currently implemented in the library

Update doc

2f76038

Rename things to be consistent with buffers.py

42f2f4a

Rename variables and add priority update

007105f

Ignore mypy

cc37cba

Add beta schedule

b60ef03

Merge branch 'master' into prioritized-experience-replay

ec272b9

qgallouedec reviewed Nov 22, 2023

View reviewed changes

Merge branch 'master' into prioritized-experience-replay

a043cfd

araffin added 2 commits January 30, 2024 15:54

Merge branch 'master' into prioritized-experience-replay

f6accf9

Merge branch 'master' into prioritized-experience-replay

b21ef33

janakact reviewed May 24, 2024

View reviewed changes

stable_baselines3/common/prioritized_replay_buffer.py Outdated Show resolved Hide resolved

Merge branch 'master' into prioritized-experience-replay

f57444a

araffin mentioned this pull request May 24, 2024

[Feature Request] RAINBOW #622

Open

1 task

Minor fix in PER

be00231

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prioritized experience replay #1622

Prioritized experience replay #1622

AlexPasqua commented Jul 23, 2023 •

edited

AlexPasqua commented Aug 6, 2023

AlexPasqua commented Sep 29, 2023

araffin commented Sep 29, 2023

AlexPasqua commented Sep 30, 2023 •

edited

araffin commented Oct 2, 2023

araffin commented Oct 4, 2023

AlexPasqua commented Oct 5, 2023

AlexPasqua commented Nov 2, 2023 •

edited

qgallouedec Nov 22, 2023

jbial May 1, 2024

qgallouedec May 1, 2024

jbial May 1, 2024

araffin May 6, 2024 •

edited

araffin May 6, 2024

araffin May 6, 2024

jbial May 6, 2024

araffin May 6, 2024 •

edited

jbial May 6, 2024

richardjozsa commented Nov 30, 2023

Prioritized experience replay #1622

Are you sure you want to change the base?

Prioritized experience replay #1622

Conversation

AlexPasqua commented Jul 23, 2023 • edited

Description

Motivation and Context

Types of changes

Checklist

AlexPasqua commented Aug 6, 2023

AlexPasqua commented Sep 29, 2023

araffin commented Sep 29, 2023

AlexPasqua commented Sep 30, 2023 • edited

araffin commented Oct 2, 2023

araffin commented Oct 4, 2023

AlexPasqua commented Oct 5, 2023

AlexPasqua commented Nov 2, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

araffin May 6, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

araffin May 6, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardjozsa commented Nov 30, 2023

AlexPasqua commented Jul 23, 2023 •

edited

AlexPasqua commented Sep 30, 2023 •

edited

AlexPasqua commented Nov 2, 2023 •

edited

araffin May 6, 2024 •

edited

araffin May 6, 2024 •

edited