Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enhancement] Implement IRPO training custom loss #1611

Open
TheGhoul21 opened this issue May 1, 2024 · 2 comments
Open

[enhancement] Implement IRPO training custom loss #1611

TheGhoul21 opened this issue May 1, 2024 · 2 comments

Comments

@TheGhoul21
Copy link

I was wondering if I could try implement the custom loss function from the paper Iterative Reasoning Preference Optimization

Screenshot 2024-05-01 at 19 27 59

Basically looks like the DPO loss is weighted by some factor alpha (in the paper the suggested value is 1.0)

the first term is just the NLL from the policy model of the winning sequence scaled by the length of the sequence itself.
The paper talks about CoT and labels, but probably this could be kept to the developer implementing it since in the loss function the terms always appear as concatenated.

What I was thinking is extending dpo_trainer (because it helps a lot I guess...), add one more hyperparameters which is alpha, defaulted to 1.0

Then the compute_loss takes is edited, and adds the first term

@TheGhoul21
Copy link
Author

Turns out the paper was corrected in V2 and now it's obviously a NLL
I did the custom implementation based on dpo and fknetuned mistral on a different dataset obtaining similar great results so I'll share my implementation in the following days

@younesbelkada
Copy link
Collaborator

Very nice, thanks @TheGhoul21 ! Let us know with @kashif when your implementation is ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants