Automated Custom Multiple-Choice Benchmark #7419
strawberrymelonpanda
started this conversation in
Show and tell
Replies: 1 comment
-
If any of you are a member of the /r/LocalLlama subreddit, I've been trying to share this over there this weekend without success. The Automod removed it because my account was too new, and now it seems the general Reddit filters aren't allowing me to try again. If anyone cares to share this over there for anyone who may be interested, I'd appreciate it. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Sharing this with anyone interested.
If you're like me and the lack of automated benchmark tools that don't require you to be a machine learning practitioner with VERY specific data formats to use has irked you, this might be useful.
A Llama.cpp PR from awhile back allowed you to specify a --binary-file and --multiple-choice flag, but you could only use a few common datasets like MMLU. I've made an encoder so that you can easily make your own custom datasets to test with.
You'll find it and instructions at this gist.
Encode.cpp will convert a JSON file to a .bin file usable by llama.cpp's --binary-file and --multiple-choice flags.
It expects a format of:
(Label 1 being the correct answer) This is the format you'll get if using the convert program here.
Like with that program, compile and use with:
To further simplify the process, I'm including tojson.py, which can turn a simpler plan-text format into the proper JSON, to then be further converted to binary. It expects the following format:
With a newline between each question set, and the correct answer as the first option. (The answer order will be shuffled in the JSON)
As an experiment, I've included a function addMultipleQuestions which you can replace the two calls to addQuestion with - which will include the same question multiple times for each possible answer, with the correct answer shuffled to each possible position. So far I've not noticed any strong difference in scoring, at the expense of extra processing time. so it's not the default.
Use like:
Beta Was this translation helpful? Give feedback.
All reactions