[enhancement] Implement IRPO training custom loss #1611

TheGhoul21 · 2024-05-01T17:31:11Z

I was wondering if I could try implement the custom loss function from the paper Iterative Reasoning Preference Optimization

Basically looks like the DPO loss is weighted by some factor alpha (in the paper the suggested value is 1.0)

the first term is just the NLL from the policy model of the winning sequence scaled by the length of the sequence itself.
The paper talks about CoT and labels, but probably this could be kept to the developer implementing it since in the loss function the terms always appear as concatenated.

What I was thinking is extending dpo_trainer (because it helps a lot I guess...), add one more hyperparameters which is alpha, defaulted to 1.0

Then the compute_loss takes is edited, and adds the first term

The text was updated successfully, but these errors were encountered:

TheGhoul21 · 2024-05-13T07:46:24Z

Turns out the paper was corrected in V2 and now it's obviously a NLL
I did the custom implementation based on dpo and fknetuned mistral on a different dataset obtaining similar great results so I'll share my implementation in the following days

younesbelkada · 2024-05-23T09:59:10Z

Very nice, thanks @TheGhoul21 ! Let us know with @kashif when your implementation is ready

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[enhancement] Implement IRPO training custom loss #1611

[enhancement] Implement IRPO training custom loss #1611

TheGhoul21 commented May 1, 2024

TheGhoul21 commented May 13, 2024

younesbelkada commented May 23, 2024

[enhancement] Implement IRPO training custom loss #1611

[enhancement] Implement IRPO training custom loss #1611

Comments

TheGhoul21 commented May 1, 2024

TheGhoul21 commented May 13, 2024

younesbelkada commented May 23, 2024