PPO / Reinforce Trainers #1540

vwxyzjn · 2024-04-15T16:06:36Z

This RP supports the REINFORCE RLOO trainers in https://arxiv.org/pdf/2402.14740.pdf.

Note that REINFORCE's loss is a special case of PPO, as shown below

it matches the REINFORCE loss presented in the Cohere paper (where PPO uses advantages A hat, but REINFORCE uses the RLHF reward R(y, x))

We add the following files

trl/trainer/ppov2_trainer.py
- a PPO implementation used in our learning to summarize reproduction work
trl/trainer/ppov2_bandit_rloo_trainer.py
- a PPO variant which implements 1) modeling completion as a joint action and 2) RLOO loss, which does not use a value network
- I copied this file directly from ppov2_trainer.py, so feel free to do a file diff to see the changes(e.g., the following diff shows how the RLOO loss is implemented)

Two more examples showing how they work with dummy reward models

examples/scripts/minimal/ppo.py: preliminary experiment shows RLHF reward goes up, so from an optimization standpoint it works as intended

examples/scripts/minimal/ppo_bandit_rloo.py: preliminary experiment shows RLHF reward goes up, so from an optimization standpoint it works as intended; though the KL kind of exploded, so we may need to use a larger beta for stronger regularization.

HuggingFaceDocBuilderDev · 2024-04-15T16:11:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lapp0

Thanks so much for this contribution. I've been hoping to experiment with REINFORCE on transformers for a while now, but didn't have the time to roll my own implementation.

This is a great foundation in terms of functionality. I'll be playing around with it soon.

I think we should reduce the repetition, and use inheritance of existing classes so we can take advantage of the great infrastructure built out by huggingface/transformers and huggingface/trl.

Happy to help if you're interested in collaborating, let me know.

trl/trainer/ppov2_bandit_rloo_trainer.py

examples/scripts/minimal/ppo_bandit_rloo_large.py

lewtun

Epic work on adding these new RL trainers @vwxyzjn ! I've left some high level feedback on the RLOO trainer for now and will do a more fine-grained review when we've iterated a bit on the design.

Overall looks super clean

examples/scripts/minimal/ppo.py

examples/scripts/minimal/ppo_bandit_rloo.py

trl/trainer/ppov2_bandit_rloo_trainer.py

vwxyzjn · 2024-04-23T18:39:59Z

Hi @lapp0 thanks for the review! Will look into these comments more closely. I started running some experiments and noticed the KL of RLOO was orders of magnitude higher than that of the new PPO trainer. Not exactly sure the reason but will further investigate.

The PPO / Vanilla PG actually seems quite stable now with RLHF reward going up and model gets good scores and reasonable completions. There were some implementation details I found particularly helpful such as truncate at EOS token (i.e., --truncate_token eos), and I suspect the same technique could work nicely with RLOO.

Right now I am a bit focused on a zephyr PPO /Vanilla PG recipe for these couple of days, and will look into RLOO right after.

lapp0

I have a WIP fork. The main difference is that it lets transformers.Trainer set everything up: batch creation, accelerate / deepspeed, etc. Instead it overrides training_step.

https://github.com/lapp0/trl/blob/onpolicy/trl/trainer/rloo_trainer.py

The main behavior difference is that it generates once per batch and runs for num_train_epochs rather than generating once per update and running for num_train_epochs * num_updates. Have you experimented with updating once per batch, and if so, does this harm stability? Is it important that I retain the ability to update once and run multiple epochs based on the model outputs from the start of the generation?

It's possible that generating once per batch instead of per update would improve KL now that you mention it.

trl/trainer/ppov2_bandit_rloo_trainer.py

trl/trainer/ppov2_trainer.py

vwxyzjn

@lapp0 @lewtun thanks so much for the review! I put some comments down and TODO items.

examples/scripts/minimal/ppo.py

trl/trainer/ppov2_bandit_rloo_trainer.py

trl/trainer/ppov2_trainer.py

vwxyzjn · 2024-04-25T15:46:36Z

After some refactoring / bug fixes, the new RLOO also seems much more stable. Will report when having newer results.

trl/trainer/ppov2_trainer.py

vwxyzjn · 2024-05-08T03:33:50Z

To show it works, per Arash's suggestion, I also ran experiments on TL;DR to see if it works. Should have more results in https://wandb.ai/costa-huang/huggingface/reports/ppo-rloo-tldr--Vmlldzo3ODUzNDEx tomorrow.

younesbelkada

Thanks a lot for this huge work ! I left some comments ! I think the new classes should be also exposed in TRL's main init - LMK wdyt about my suggestions below 🙏

examples/accelerate_configs/deepspeed_zero2.yaml

examples/accelerate_configs/deepspeed_zero3.yaml

younesbelkada · 2024-05-10T07:15:31Z

trl/trainer/ppov2_trainer.py

+def masked_mean(values, mask, axis=None):
+ """Compute mean of tensor with a masked values."""
+ if axis is not None:
+ return (values * mask).sum(axis=axis) / mask.sum(axis=axis)
+ else:
+ return (values * mask).sum() / mask.sum()
+
+
+def masked_var(values, mask, unbiased=True):
+ """Compute variance of tensor with masked values."""
+ mean = masked_mean(values, mask)
+ centered_values = values - mean
+ variance = masked_mean(centered_values**2, mask)
+ if unbiased:
+ mask_sum = mask.sum()
+ if mask_sum == 0:
+ raise ValueError(
+ "The sum of the mask is zero, which can happen when `mini_batch_size=1`;"
+ "try increase the `mini_batch_size` or `gradient_accumulation_steps`"
+ )
+ # note that if mask_sum == 1, then there is a division by zero issue
+ # to avoid it you just need to use a larger minibatch_size
+ bessel_correction = mask_sum / (mask_sum - 1)
+ variance = variance * bessel_correction
+ return variance
+
+
+def masked_whiten(values, mask, shift_mean=True):
+ """Whiten values with masked values."""
+ mean, var = masked_mean(values, mask), masked_var(values, mask, False)
+ whitened = (values - mean) * torch.rsqrt(var + 1e-8)
+ if not shift_mean:
+ whitened += mean
+ return whitened


those look the same as in:

trl/trl/core.py

Line 152 in 3b4c249

def masked_mean(values: torch.Tensor, mask: torch.Tensor, axis: Optional[bool] = None) -> torch.Tensor:

- can't you re-use them from trl.core?

younesbelkada · 2024-05-10T07:15:48Z

trl/trainer/ppov2_trainer.py

+ return whitened
+
+
+def get_reward(model, query_responses, tokenizer, context_length):


can you move this method to trl.core?

younesbelkada · 2024-05-10T07:15:55Z

trl/trainer/ppov2_trainer.py

+
+def get_reward(model, query_responses, tokenizer, context_length):
+ attention_mask = query_responses != tokenizer.pad_token_id
+ # position_ids = attention_mask.cumsum(1) - attention_mask.long() # exclusive cumsum


Suggested change

# position_ids = attention_mask.cumsum(1) - attention_mask.long() # exclusive cumsum

trl/trainer/rloo_trainer.py

younesbelkada · 2024-05-10T07:21:38Z

examples/scripts/minimal/ppo.py

Should we move all ppo-related minimal scripts under a new ppo/ dir and rloo under rloo/ dir ? What do you think?

lewtun

Thanks for iterating on this epic PR @vwxyzjn ! Overall it's looking quite close to being finished and I think the main remaining points to address are splitting off the configs into their own modules and seeing if we can hide config variables like world_size from the end user

lewtun · 2024-05-13T15:00:55Z

examples/scripts/minimal/ppo.py

+ parser = HfArgumentParser((PPOConfig, ModelConfig))
+ config, model_config = parser.parse_args_into_dataclasses()
+ # remove output_dir if exists
+ shutil.rmtree(config.output_dir, ignore_errors=True)


FYI you can set overwrite_output_dir in PPOConfig (via TrainingArguments)

I gave it a quick test but it does not seem to remove the output_dir.

A quick search shows that the removing logic seems no longer there.

lewtun · 2024-05-13T15:06:07Z

trl/models/utils.py

@@ -150,7 +150,7 @@ def unwrap_model_for_generation(
 if accelerator.state.deepspeed_plugin is not None and accelerator.state.deepspeed_plugin.zero_stage == 3:
 with deepspeed.zero.GatheredParameters(model.parameters()):
 remove_hooks(model)
- yield model
+ yield accelerator.unwrap_model(model)


I believe models wrapped with the DeepSpeedEngine can still generate, so I'm curious why this is needed

Ah it causes issues for RLOO, having some errors like "DeepSpeedEngine has not attribute generate`, so we still need to unwrap it.

lewtun · 2024-05-13T15:08:04Z

trl/trainer/ppov2_trainer.py

+ """Whether to use deepspeed to train the model"""
+
+ # various batch sizes
+ world_size: Optional[int] = None


I see, but why do we only seem to need this for the RL trainers and not the other ones like SFTTrainer? In general, I'd like to avoid exposing this distributed stuff to the user if we can because it might not be clear if they should set the value manually or let accelerate handle it for them

trl/trainer/ppov2_trainer.py

trl/trainer/rloo_trainer.py

vwxyzjn · 2024-05-15T03:10:22Z

Thank you @lewtun @younesbelkada @lapp0 for the review. I have addressed most of the concerns and also added some docs and benchmarks. Let me know if there is anything else needed :D

younesbelkada

Huge work ! Thanks @vwxyzjn ! Good for me to merge once @lewtun is happy about the latest changes + CI is green ! 🚀

lapp0 · 2024-05-18T13:00:27Z

Great work @vwxyzjn! Really impressed with the vLLM integration along with the other components you've introduced here.

I'll be working on a follow-up PR for quantized training using ppo_v2 once Unsloth's numerical stability issue is resolved, and hopefully incorporate a few structural changes as well, so I don't have any further comments on structure right now.

Did any of your RLOO runs result in improved benchmarks or at least improved score metrics? I was able to reproduce improving scores with ppov2 my refactor of your branch with BnB / peft support, but I never managed to do the same with RLOO.

PPOV2 metrics:

vwxyzjn · 2024-05-21T20:58:55Z

@lapp0 Very nice to hear your great results with PPOv2 and peft! I was able to get 1B RLOO good results on tl;dr summarization. See https://moon-ci-docs.huggingface.co/docs/trl/pr_1540/en/rloo_trainer#benchmark-experiments.

lapp0 · 2024-05-22T21:02:49Z

Great work @vwxyzjn really exciting research and implementation you have put together. Feel free to ping me on any other PRs.

vwxyzjn added 8 commits April 12, 2024 15:48

Add ppov2 trainer

4b53069

make eos trick optional, remove unused args

2f94ccf

quick fix

7433733

precommit

138078a

update debugging script

a9b58c8

fix out of bound drop_last=True; use built-in scheduler

2afde83

Add PPO examples

907ea8e

push changes

6ecbc0d

vwxyzjn requested a review from lewtun April 15, 2024 16:06

quick change

61e3901

vwxyzjn mentioned this pull request Apr 17, 2024

Adds reward bootstrapping to PPOTrainer #1536

Closed

vwxyzjn added 3 commits April 18, 2024 03:24

quick change

fd780b2

various bug fixes

cd0a006

remove unnecessary grad accumulation setting

753acb9

lapp0 suggested changes Apr 22, 2024

View reviewed changes

push new changes

c54f111

lewtun reviewed Apr 22, 2024

View reviewed changes

lapp0 reviewed Apr 23, 2024

View reviewed changes

trl/trainer/ppov2_bandit_rloo_trainer.py Outdated Show resolved Hide resolved

lapp0 reviewed Apr 23, 2024

View reviewed changes

trl/trainer/ppov2_bandit_rloo_trainer.py Outdated Show resolved Hide resolved

vwxyzjn added 3 commits April 23, 2024 17:54

fix DS3 model saving

6d2caa0

update ppo.py

03b2b1f

refactor

3be2d84

lapp0 suggested changes Apr 23, 2024

View reviewed changes

trl/trainer/ppov2_bandit_rloo_trainer.py Outdated Show resolved Hide resolved

trl/trainer/ppov2_trainer.py Show resolved Hide resolved

lapp0 mentioned this pull request Apr 25, 2024

[WIP] Unify Policy Trainers #1586

Draft

4 tasks

vwxyzjn commented Apr 25, 2024

View reviewed changes

quick change

c8a4a4c

lapp0 reviewed Apr 25, 2024

View reviewed changes

trl/trainer/ppov2_trainer.py Outdated Show resolved Hide resolved

vwxyzjn mentioned this pull request May 9, 2024

[Feature]: Allow LoRA adapters to be specified as in-memory dict of tensors vllm-project/vllm#4068

Open

younesbelkada reviewed May 10, 2024

View reviewed changes

vwxyzjn added 5 commits May 10, 2024 21:02

utilize all samples in rloo

f18a750

quick setting

1123a2a

remove the unnecessary value_model

80010c3

use exact_div

63396b6

allow saving the deepspeed model

ad15aee

lewtun reviewed May 13, 2024

View reviewed changes

vwxyzjn added 6 commits May 14, 2024 19:51

refactor

74e015e

remove dead code

a68630f

Use some shared utilities

dee3634

add some end-to-end test cases

947414f

add PPOv2 docs and RLOO docs / tests

5fe1746

update docs

e16b0ab

Merge branch 'main' into onpolicy

c6456b0

vwxyzjn mentioned this pull request May 15, 2024

Have trouble in ppo example #1618

Closed

vwxyzjn added 2 commits May 15, 2024 19:44

quikc push

88f4aa7

Merge branch 'onpolicy' of https://github.com/vwxyzjn/trl into onpolicy

ca607b6

younesbelkada approved these changes May 16, 2024

View reviewed changes

vwxyzjn added 2 commits May 16, 2024 12:55

fix ci

d88c2cc

fix type annotation for ci

e2f7bbc

lapp0 mentioned this pull request May 20, 2024

NaN during llama3 finetuning unslothai/unsloth#427

Open

vwxyzjn added 2 commits May 21, 2024 21:16

quick update

25d1092

update trainer docs

dd2ecc1

vwxyzjn merged commit 13454d2 into huggingface:main May 22, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO / Reinforce Trainers #1540

PPO / Reinforce Trainers #1540

vwxyzjn commented Apr 15, 2024

HuggingFaceDocBuilderDev commented Apr 15, 2024

lapp0 left a comment •

edited

lewtun left a comment

vwxyzjn commented Apr 23, 2024

lapp0 left a comment •

edited

vwxyzjn left a comment

vwxyzjn commented Apr 25, 2024

vwxyzjn commented May 8, 2024

younesbelkada left a comment

younesbelkada May 10, 2024

younesbelkada May 10, 2024

younesbelkada May 10, 2024

younesbelkada May 10, 2024

lewtun left a comment

lewtun May 13, 2024

vwxyzjn May 14, 2024

lewtun May 13, 2024

vwxyzjn May 14, 2024

lewtun May 13, 2024

vwxyzjn commented May 15, 2024

younesbelkada left a comment

lapp0 commented May 18, 2024 •

edited

vwxyzjn commented May 21, 2024

lapp0 commented May 22, 2024

		return whitened


		def get_reward(model, query_responses, tokenizer, context_length):

PPO / Reinforce Trainers #1540

PPO / Reinforce Trainers #1540

Conversation

vwxyzjn commented Apr 15, 2024

HuggingFaceDocBuilderDev commented Apr 15, 2024

lapp0 left a comment • edited

Choose a reason for hiding this comment

lewtun left a comment

Choose a reason for hiding this comment

vwxyzjn commented Apr 23, 2024

lapp0 left a comment • edited

Choose a reason for hiding this comment

vwxyzjn left a comment

Choose a reason for hiding this comment

vwxyzjn commented Apr 25, 2024

vwxyzjn commented May 8, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lewtun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vwxyzjn commented May 15, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

lapp0 commented May 18, 2024 • edited

vwxyzjn commented May 21, 2024

lapp0 commented May 22, 2024

lapp0 left a comment •

edited

lapp0 left a comment •

edited

lapp0 commented May 18, 2024 •

edited