Added Reward Backpropogation Support #1585

mihirp1998 · 2024-04-25T07:07:58Z

Hi,

I have added support for AlignProp (https://align-prop.github.io/) for finetuning Stable Diffusion model using reward gradients.

AlignProp directly backpropagate gradients from the reward model to the diffusion weights. Thus is about 25x more sample and compute efficient than policy gradient based methods like DDPO.

The current implementation seems to train effectively, almost within an hour on a single A100 while using Aesthetic reward model. Please find the attached loss and reward curves + some qualitative results after training.

huggingface/diffusers#7312

Difference between DDPO and AlignProp:

DDPO uses PPO, which is a policy gradient method for aligning diffusion models. AlignProp doesn't use policy gradients instead it directly backpropagates gradients from the reward function to diffusion denoising process, to maximize reward.
AlignProp can only work when the reward function is differentiable, DDPO on other hand can handle non-differentiable reward functions, as it never backpropagates gradients from the reward function weights.
As AlignProp takes benefit of the differentiability of the reward function as it backpropagates gradient. It is significantly more sample efficient than DDPO.
The loss function in AlignProp is simply the negative of the reward value outputed by the reward function, while in DDPO it's the PPO loss function.
As the reward function is sitting on the RGB images. AlignProp requires to do the full denoising chain from Noise to RGB during training, while DDPO can instead sample random denoising timesteps, similar to diffusion training.
DDPO and AlignProp both use LoRA and gradient checkpointing.

CC: @parthos86 @sayakpaul @lvwerra @younesbelkada

Image Generations post training:

younesbelkada

Thanks a lot @mihirp1998 for your hardwork ! In principle this looks good !
I just have few questions with respect to the differences between this method and DDPO, could you clearly highlight either on the documentation or in this PR what are the major differences between DDPO and this algorithm ? 🙏
I would also like to have a review from @sayakpaul if possible, what do you think of Reward Backpropagation ?
Thanks !

younesbelkada · 2024-05-28T17:56:32Z

docs/source/alignprop_trainer.mdx

+| Before | After finetuning |
+| --- | --- |
+| <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/pre_squirrel.png"/></div> | <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/post_squirrel.png"/></div> |
+| <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/pre_crab.png"/></div> | <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/post_crab.png"/></div> |
+| <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/pre_starfish.png"/></div> | <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/post_starfish.png"/></div> |


These images are the ones generated from DDPO no?

Yes, i wanted to update them, although I wasn't sure how to do it, as they linked to a huggingface internal webpage https://huggingface.co/datasets/trl-internal-testing/

If you can guide me on how to do it, i can update them.

You can open a PR to https://huggingface.co/datasets/trl-internal-testing/ repository adding the resultant images you want.

younesbelkada · 2024-05-28T17:56:44Z

docs/source/alignprop_trainer.mdx

+library. A reason for stating this is that getting started requires a bit of familiarity with the `diffusers` library concepts, mainly two of them - pipelines and schedulers.
+Right out of the box (`diffusers` library), there isn't a `Pipeline` nor a `Scheduler` instance that is suitable for finetuning with reinforcement learning. Some adjustments need to made. 
+
+There is a pipeline interface that is provided by this library that is required to be implemented to be used with the `DDPOTrainer`, which is the main machinery for fine-tuning Stable Diffusion with reinforcement learning. **Note: Only the StableDiffusion architecture is supported at this point.**


Here it references DDPO trainer

Thanks for pointing this. I have fixed this.

mihirp1998 · 2024-05-28T21:24:02Z

Thanks a lot @mihirp1998 for your hardwork ! In principle this looks good ! I just have few questions with respect to the differences between this method and DDPO, could you clearly highlight either on the documentation or in this PR what are the major differences between DDPO and this algorithm ? 🙏 I would also like to have a review from @sayakpaul if possible, what do you think of Reward Backpropagation ? Thanks !

I have added the differences in the pull request, let me know if u have some doubts or think something is missing.

sayakpaul · 2024-05-29T02:34:14Z

docs/source/alignprop_trainer.mdx

@@ -0,0 +1,117 @@
+# Aligning Text-to-Image Diffusion Models with Reward Backpropagation
+
+## The why


I don't think as a reader I understand if the following table justifies the name of this section. Would you mind elaborating?

Thanks, I added a better why statement.

sayakpaul · 2024-05-29T02:34:50Z

docs/source/alignprop_trainer.mdx

+| <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/pre_starfish.png"/></div> | <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/post_starfish.png"/></div> |
+
+
+## Getting started with Stable Diffusion finetuning with reinforcement learning


I don't think this is needed. We should strive to keep the API documentation lean and precise.

Yes i removed it.

sayakpaul · 2024-05-29T02:35:43Z

docs/source/alignprop_trainer.mdx

+```python
+
+import torch
+from trl import DefaultDDPOStableDiffusionPipeline


Why do we have to use a non-diffusers pipeline here? Does DiffusionPipeline from diffusers not work here?

Yes indeed, i changed it to StableDiffusionPipeline from diffusers

sayakpaul · 2024-05-29T02:36:50Z

docs/source/alignprop_trainer.mdx

+pipeline = DefaultDDPOStableDiffusionPipeline("metric-space/alignprop-finetuned-sd-model")
+
+device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+
+# memory optimization
+pipeline.vae.to(device, torch.float16)
+pipeline.text_encoder.to(device, torch.float16)
+pipeline.unet.to(device, torch.float16)


These LoCs could be reduce if we do:

pipeline = DefaultDDPOStableDiffusionPipeline("metric-space/alignprop-finetuned-sd-model", torch_dtype=torch.float16) pipeline = pipeline.to("cuda")

Additionally, https://huggingface.co/metric-space/alignprop-finetuned-sd-model is not available. Let's make sure we're using the right checkpoint ids here.

Yes i reduced it and fixed the checkpoint ids.

sayakpaul · 2024-05-29T02:38:43Z

examples/scripts/alignprop.py

+class MLP(nn.Module):
+ def __init__(self):
+ super().__init__()
+ self.layers = nn.Sequential(
+ nn.Linear(768, 1024),
+ nn.Dropout(0.2),
+ nn.Linear(1024, 128),
+ nn.Dropout(0.2),
+ nn.Linear(128, 64),
+ nn.Dropout(0.1),
+ nn.Linear(64, 16),
+ nn.Linear(16, 1),
+ )
+
+ def forward(self, embed):
+ return self.layers(embed)
+
+
+class AestheticScorer(torch.nn.Module):
+ """
+ This model attempts to predict the aesthetic score of an image. The aesthetic score
+ is a numerical approximation of how much a specific image is liked by humans on average.
+ This is from https://github.com/christophschuhmann/improved-aesthetic-predictor


Why are we copy-pasting these modules from the DDPO script?

@younesbelkada would it make sense to have a separate module for these (auxiliary_modules, perhaps)?

They are not exactly copy pasted, as DDPO had clamp and no_grad operations within them, which were preventing gradients from backpropagating.

Anyhow I still transfered the above reward function code from alignprop.py to trl/models/auxiliary_modules.py, as u suggested.

sayakpaul · 2024-05-29T02:41:19Z

trl/models/modeling_sd_base.py

+ if truncated_backprop:
+ if truncated_backprop_rand:
+ rand_timestep = random.randint(truncated_rand_backprop_minmax[0],truncated_rand_backprop_minmax[1])
+ if i < rand_timestep:
+ noise_pred = noise_pred.detach()
+ else:
+ if i < truncated_backprop_timestep:
+ noise_pred = noise_pred.detach()


We would want to supplement this code block with comments.

Yes added comments.

sayakpaul · 2024-05-29T02:42:09Z

trl/models/modeling_sd_base.py

@@ -527,6 +527,243 @@ def pipeline_step(

 return DDPOPipelineOutput(image, all_latents, all_log_probs)

+def pipeline_step_with_grad(
+ self,


self, could be replaced with pipeline as that is what we're passing down the line, IIUC?

Yes i changed it to pipeline.

sayakpaul · 2024-05-29T02:43:15Z

trl/trainer/alignprop_trainer.py

+
+# {model_name}
+
+This is a pipeline that finetunes a diffusion model with reward gradients. The model can be used for image generation conditioned with text.


Not sure what is the norm is within the library but I think it could be nice to also include a link to the AlignProp paper here.

Yes i added.

sayakpaul

Thanks for your contributions. I left a couple of comments.

I would love to see some concrete comparisons to DDPO (training time, reward dynamics, convergence of the validation samples, etc.).

mihirp1998 · 2024-06-02T02:44:47Z

Thanks for your contributions. I left a couple of comments.

I would love to see some concrete comparisons to DDPO (training time, reward dynamics, convergence of the validation samples, etc.).

I have made concrete comparisions with DDPO here. I ran the DDPO default code in TRL with batch size 128, while AlignProp also uses the same batch size. As can be seen AlignProp is significantly more sample efficient, here i train both the models for a few hours. Here x-axis is the epochs and y-axis is the reward achieved.

Although i ran the above experiments for a few hours, AlignProp only takes about 30 minutes to converge to a good solution. So i early stopped at the 8th epoch in training. Below is the comparision with DDPO after training both models for 30 minutes. Here x-axis is the training time and y-axis is the reward achieved.

Both the curves are similar to the curves in the AlignProp paper.

The above curves were with the same set of prompts during training/testing. In the curve below i show AlignProp results on unseen prompts. As can be seen there is not much gap in results between seen/unseen prompts. Here dotted lines is the unseen prompts while solid line is the seen prompts.

Finally here are some generated images from AlignProp, for seen/unseen animals after training.

CC: @sayakpaul @younesbelkada

mihirp1998 and others added 4 commits April 22, 2024 23:35

added alignprop template

209fb01

added alignprop support

0db8104

merged with trl

8030ac7

Update alignprop_trainer.mdx

4b9dc1a

mihirp1998 changed the title ~~Added AlignProp Support~~ Added Reward Backpropogation Support May 1, 2024

huggingface deleted a comment from github-actions bot May 28, 2024

younesbelkada reviewed May 28, 2024

View reviewed changes

Update alignprop_trainer.mdx

af84272

sayakpaul reviewed May 29, 2024

View reviewed changes

mihirp1998 added 2 commits June 1, 2024 16:34

added better why statement

f3ff177

fixed inference code

c3fe757

mihirp1998 added 6 commits June 2, 2024 01:06

changed self to pipeline

4f8501e

removed aesthetic classifier

34af985

added aesthetic to auxiliary models

c0a6ce3

added unseen prompt logging

fbeafbc

removed unseen prompt log

d804207

fixed minor

405db53

mihirp1998 requested review from younesbelkada and sayakpaul June 4, 2024 00:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Reward Backpropogation Support #1585

Added Reward Backpropogation Support #1585

mihirp1998 commented Apr 25, 2024 •

edited

younesbelkada left a comment

younesbelkada May 28, 2024

mihirp1998 May 28, 2024

sayakpaul May 29, 2024

younesbelkada May 28, 2024

mihirp1998 May 28, 2024

mihirp1998 commented May 28, 2024

sayakpaul May 29, 2024

mihirp1998 Jun 2, 2024

sayakpaul May 29, 2024

mihirp1998 Jun 2, 2024

sayakpaul May 29, 2024 •

edited

mihirp1998 Jun 2, 2024

sayakpaul May 29, 2024

sayakpaul May 29, 2024

mihirp1998 Jun 2, 2024

sayakpaul May 29, 2024

mihirp1998 Jun 2, 2024

sayakpaul May 29, 2024

mihirp1998 Jun 2, 2024

sayakpaul May 29, 2024 •

edited

mihirp1998 Jun 2, 2024

sayakpaul May 29, 2024

mihirp1998 Jun 2, 2024

sayakpaul left a comment

mihirp1998 commented Jun 2, 2024 •

edited

		@@ -0,0 +1,117 @@
		# Aligning Text-to-Image Diffusion Models with Reward Backpropagation

		## The why

		\| <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/pre_starfish.png"/></div> \| <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/post_starfish.png"/></div> \|


		## Getting started with Stable Diffusion finetuning with reinforcement learning


		# {model_name}

		This is a pipeline that finetunes a diffusion model with reward gradients. The model can be used for image generation conditioned with text.

Added Reward Backpropogation Support #1585

Are you sure you want to change the base?

Added Reward Backpropogation Support #1585

Conversation

mihirp1998 commented Apr 25, 2024 • edited

younesbelkada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mihirp1998 commented May 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul May 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul May 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul left a comment

Choose a reason for hiding this comment

mihirp1998 commented Jun 2, 2024 • edited

mihirp1998 commented Apr 25, 2024 •

edited

sayakpaul May 29, 2024 •

edited

sayakpaul May 29, 2024 •

edited

mihirp1998 commented Jun 2, 2024 •

edited