Automated Custom Multiple-Choice Benchmark #7419

strawberrymelonpanda · 2024-05-20T19:14:08Z

strawberrymelonpanda
May 20, 2024

Sharing this with anyone interested.

If you're like me and the lack of automated benchmark tools that don't require you to be a machine learning practitioner with VERY specific data formats to use has irked you, this might be useful.

A Llama.cpp PR from awhile back allowed you to specify a --binary-file and --multiple-choice flag, but you could only use a few common datasets like MMLU. I've made an encoder so that you can easily make your own custom datasets to test with.

You'll find it and instructions at this gist.

Encode.cpp will convert a JSON file to a .bin file usable by llama.cpp's --binary-file and --multiple-choice flags.

It expects a format of:

[
  { 
    "multiple_correct": {"answers": [], "labels": [] },
    "question": "Question: \"QUESTION TEXT\" Answer:",
    "single_correct": {

    "answers": [
      "ANSWER TEXT 1",
      "ANSWER TEXT 2",
      "ANSWER TEXT 3",
      "ANSWER TEXT 4"
    ],
    "labels": [0,0,0,1]
  }
]

(Label 1 being the correct answer) This is the format you'll get if using the convert program here.

Like with that program, compile and use with:

g++ -o encode encode.cpp
./encode arc-easy-validation.json arc-easy-validation.bin
./perplexity -m model -bf arc-easy-validation.bin --multiple-choice

To further simplify the process, I'm including tojson.py, which can turn a simpler plan-text format into the proper JSON, to then be further converted to binary. It expects the following format:

Q:] QUESTION TEXT
A1:] CORRECT ANSWER TEXT
A2:] ANSWER
A3:] ANSWER
A4:] ANSWER

Q:] QUESTION TEXT
A1:] CORRECT ANSWER TEXT
A2:] ANSWER
A3:] ANSWER
A4:] ANSWER

With a newline between each question set, and the correct answer as the first option. (The answer order will be shuffled in the JSON)

As an experiment, I've included a function addMultipleQuestions which you can replace the two calls to addQuestion with - which will include the same question multiple times for each possible answer, with the correct answer shuffled to each possible position. So far I've not noticed any strong difference in scoring, at the expense of extra processing time. so it's not the default.

Use like:

python tojson.py custom-test.txt custom-test.json
./encode custom-test.json custom-test.bin
./perplexity -m model -bf custom-test.bin --multiple choice

strawberrymelonpanda · 2024-05-20T19:18:01Z

strawberrymelonpanda
May 20, 2024
Author

If any of you are a member of the /r/LocalLlama subreddit, I've been trying to share this over there this weekend without success. The Automod removed it because my account was too new, and now it seems the general Reddit filters aren't allowing me to try again.

If anyone cares to share this over there for anyone who may be interested, I'd appreciate it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated Custom Multiple-Choice Benchmark #7419

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Automated Custom Multiple-Choice Benchmark #7419

strawberrymelonpanda May 20, 2024

Replies: 1 comment

strawberrymelonpanda May 20, 2024 Author

strawberrymelonpanda
May 20, 2024

strawberrymelonpanda
May 20, 2024
Author