generated from fastai/nbdev_template
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[enhancement] Implement IRPO training custom loss #1611
Comments
Turns out the paper was corrected in V2 and now it's obviously a NLL |
Very nice, thanks @TheGhoul21 ! Let us know with @kashif when your implementation is ready |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I was wondering if I could try implement the custom loss function from the paper
Iterative Reasoning Preference Optimization
Basically looks like the DPO loss is weighted by some factor alpha (in the paper the suggested value is 1.0)
the first term is just the NLL from the policy model of the winning sequence scaled by the length of the sequence itself.
The paper talks about CoT and labels, but probably this could be kept to the developer implementing it since in the loss function the terms always appear as concatenated.
What I was thinking is extending dpo_trainer (because it helps a lot I guess...), add one more hyperparameters which is alpha, defaulted to 1.0
Then the compute_loss takes is edited, and adds the first term
The text was updated successfully, but these errors were encountered: