Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about performance improvement in Open LLM leaderboard #21

Open
minstar opened this issue Mar 7, 2024 · 3 comments
Open

Questions about performance improvement in Open LLM leaderboard #21

minstar opened this issue Mar 7, 2024 · 3 comments

Comments

@minstar
Copy link

minstar commented Mar 7, 2024

Hi,
First of all, thank you for sharing your wonderful work!

I was searching for efficient ways of mining instructions used in instruction-tuning LLMs.
While reading the manuscript and investigating your provided open-sourced 6k & 10k datasets,
I could not intuitively understand why the SFT (6k) +DPO (10k) training method increases the performance of
the multi-choice question answering tasks such as ARC-challenge and MMLU?

In the dataset, the instances are composed of conversations between humans and GPT which don't have any clue about solving multi-choice QA problems.

Do you have any ideas why it worked?

@VPeterV
Copy link
Collaborator

VPeterV commented Mar 21, 2024

Hi, thanks for your interest!

This question is indeed interesting. We have a couple of speculations that might shed some light:

  1. Our top-performing model, trained using SFT (6k) combined with DPO (10k), originates from an intermediate SFT checkpoint. This checkpoint serves as the basis for further DPO training. Our hypothesis is that an overly optimized SFT might impair the inherent capabilities of LLMs. Therefore, utilizing a sub-optimal SFT checkpoint, followed by DPO training, which is specifically designed for alignment, appears to enhance performance on both academic benchmarks like the OpenLLM benchmark and alignment capabilities. This finding can also be found on Zephyr [1, 2].

  2. It's observed that some questions incorrectly answered by the models can be rectified through multiple sampling attempts, employing strategies like majority voting or re-ranking. This indicates that the model has the potential to answer correctly but struggles to do so consistently. Reinforcement learning techniques such as DPO can adjust the model's output preferences, increasing the likelihood of producing the correct answer in a single attempt [3, Section 5].

References

@minstar
Copy link
Author

minstar commented Mar 21, 2024

Thanks for suggesting your insights and thoughts about my curious question!

I also agreed on the second point that the model has the potential to answer but not consistently to do it.
However, still have a hard time interpreting what DPO could enhance through preference alignment.

@VPeterV
Copy link
Collaborator

VPeterV commented Mar 21, 2024

A potential explanation might be the presence of STEM-related samples within the UltraFeedback Datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants