Skip to content

C Eval performance and script

Ziqing Yang edited this page Jun 16, 2023 · 6 revisions

Results on C-Eval

This project tested the performance of the relevant models on the recently released C-Eval benchmark dataset. The test set consists of 12.3K multiple-choice questions covering 52 subjects. Below are the validation and test set evaluation results (Average) for some of the models. For the complete results, please refer to our technical report.

Model Valid (zero-shot) Valid (5-shot) Test (zero-shot) Test (5-shot)
Chinese-Alpaca-33B 43.3 42.6 41.6 40.4
Chinese-LLaMA-33B 34.9 38.4 34.6 39.5
Chinese-Alpaca-Plus-13B 43.3 42.4 41.5 39.9
Chinese-LLaMA-Plus-13B 27.3 34.0 27.8 33.3
Chinese-Alpaca-Plus-7B 36.7 32.9 36.4 32.3
Chinese-LLaMA-Plus-7B 27.3 28.3 26.9 28.4

In the following, we will introduce the prediction method for the C-Eval dataset. Users can also refer to our Colab Notebook for reference: Open In Colab

Data Preparation

Download the dataset from the path specified in official C-Eval, and unzip the file to the data folder:

wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip
unzip ceval-exam.zip -d data

Move data to scripts/ceval directory of this project.

Runing the Evaluation Script

Run the following script:

model_path=path/to/chinese_llama_or_alpaca
output_path=path/to/your_output_dir

cd scripts/ceval
python eval.py \
    --model_path ${model_path} \
    --cot False \
    --few_shot False \
    --with_prompt True \
    --constrained_decoding True \
    --temperature 0.2 \
    --n_times 1 \
    --ntrain 5 \
    --do_save_csv False \
    --do_test False \
    --output_dir ${output_path} \

Arguments

  • model_path: Path to the model to be evaluated (the model merged with LoRA in HF format)

  • cot: Whether to use chain-of-thought

  • few_shot: Whether to use few-shot

  • ntrain: Specifies the number of few-shot demos when few_shot=True (5-shot: ntrain=5); When few_shot=False, this argument does not have any effect

  • with_prompt: Whether input to the model contains the instruction prompt for Alpaca models

  • constrained_decoding: Since the standard answer format for C-Eval is option 'A'/'B'/'C'/'D', we provide two methods for extracting answers from models' outputs:

    • constrained_decoding=True: Compute the probability that the first token generated by the model is 'A', 'B', 'C', 'D', and choose the one with the highest probability as the answer

    • constrained_decoding=False: Extract the answer token from model's outputs with regular expressions

  • temperature: Temperature for decoding

  • n_times: The number of repeated evaluations. Folders will be generated under output_dir corresponding to the specified number of times

  • do_save_csv: Whether to save the model outputs, extracted answers, etc. in csv files

  • output_dir: Output path of results

  • do_test: Whether to evaluate on the valid or test set: evaluate on the valid set when do_test=False and on the test set when do_test=True

Evaluation Output

  • The evaluation script creates directories outputs\take* when finishing evaluation, where * is a number ranges from 0 to n_times-1, storing the results of the n_times repeated evaluations respectively.

  • In each outputs\take*, there will be a submission.json and a summary.json . If do_save_csv=True, there will be also 52 csv files that contain model outputs, extracted answers for each subject, etc.

  • submission.json stores generated answers in the official submission form, and can be submitted for evaluation:
{
    "computer_network": {
        "0": "A",
        "1": "B",
        ...
    },
      "marxism": {
        "0": "B",
        "1": "A",
        ...
      },
  	...
}
  • summary.json stores model evaluation results under 52 subjects, 4 broader categories and an overall average. For instance, The 'All' key at end the json file shows the overall average score:

      "All": {
        "score": 0.36701337295690933,
        "num": 1346,
      "correct": 494.0
    }

where score is the overall accuracy, num is the total number of evaluation examples, and correct is the number of correct predictions.

⚠️ Note that when evaluating on the test set (do_test=True), score and correct are 0 since there are no labels available. The test set results require submitting the submission.json file to the official C-Eval. For detailed instructions, please refer to the official submission process provided by C-Eval.

Clone this wiki locally