Bad performance on PrOntoQA benchmark #28

huskydoge · 2024-04-10T03:18:29Z

PrOntoQA is a question-answering dataset that generates examples with chains-of-thought that describe the reasoning required to answer the questions correctly. The sentences in the examples are syntactically simple and amenable to semantic parsing. It can be used to formally analyze the predicted chain-of-thought from large language models.

I have tested the performance of DBRX-Base on GSM8k, AQuA, strategyQA dataset using COT-4-shot, its performance is satisfying compared to other models (GPT4, Claude Opus, LLama 70B, etc.).

Nevertheless, when I test the model's performance on PrOntoQA, its performance is not that satisfying, where dbrx-instruction achieves a 24.2% accuracy and dbrx-base is worse. Although there might be some output processing errors when using dbrx-base, dbrx-instruct has no problem with endless generation but still fails to achieve a good performance.

Therefore, I want to know whether there is an official test result on PrOntoQA for others to take as a reference.

Thanks!

hanlint · 2024-04-15T00:02:58Z

Hello @huskydoge , we have not tried PrOntoQA yet, but will let you know if we do!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad performance on PrOntoQA benchmark #28

Bad performance on PrOntoQA benchmark #28

huskydoge commented Apr 10, 2024

hanlint commented Apr 15, 2024

Bad performance on PrOntoQA benchmark #28

Bad performance on PrOntoQA benchmark #28

Comments

huskydoge commented Apr 10, 2024

hanlint commented Apr 15, 2024