-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad performance on PrOntoQA benchmark #28
Comments
Hello @huskydoge , we have not tried PrOntoQA yet, but will let you know if we do! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
PrOntoQA is a question-answering dataset that generates examples with chains-of-thought that describe the reasoning required to answer the questions correctly. The sentences in the examples are syntactically simple and amenable to semantic parsing. It can be used to formally analyze the predicted chain-of-thought from large language models.
I have tested the performance of DBRX-Base on GSM8k, AQuA, strategyQA dataset using COT-4-shot, its performance is satisfying compared to other models (GPT4, Claude Opus, LLama 70B, etc.).
Nevertheless, when I test the model's performance on PrOntoQA, its performance is not that satisfying, where dbrx-instruction achieves a 24.2% accuracy and dbrx-base is worse. Although there might be some output processing errors when using dbrx-base, dbrx-instruct has no problem with endless generation but still fails to achieve a good performance.
Therefore, I want to know whether there is an official test result on PrOntoQA for others to take as a reference.
Thanks!
The text was updated successfully, but these errors were encountered: