C Eval performance and script

Results on C-Eval

This project tested the performance of the relevant models on the recently released C-Eval benchmark dataset. The test set consists of 12.3K multiple-choice questions covering 52 subjects. Below are the validation and test set evaluation results (Average) for some of the models. For the complete results, please refer to our technical report.

Model	Valid (zero-shot)	Valid (5-shot)	Test (zero-shot)	Test (5-shot)
Chinese-Alpaca-33B	43.3	42.6	41.6	40.4
Chinese-LLaMA-33B	34.9	38.4	34.6	39.5
Chinese-Alpaca-Plus-13B	43.3	42.4	41.5	39.9
Chinese-LLaMA-Plus-13B	27.3	34.0	27.8	33.3
Chinese-Alpaca-Plus-7B	36.7	32.9	36.4	32.3
Chinese-LLaMA-Plus-7B	27.3	28.3	26.9	28.4

In the following, we will introduce the prediction method for the C-Eval dataset. Users can also refer to our Colab Notebook for reference:

Data Preparation

Download the dataset from the path specified in official C-Eval, and unzip the file to the data folder:

wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip
unzip ceval-exam.zip -d data

Move data to scripts/ceval directory of this project.

Runing the Evaluation Script

Run the following script:

model_path=path/to/chinese_llama_or_alpaca
output_path=path/to/your_output_dir

cd scripts/ceval
python eval.py \
    --model_path ${model_path} \
    --cot False \
    --few_shot False \
    --with_prompt True \
    --constrained_decoding True \
    --temperature 0.2 \
    --n_times 1 \
    --ntrain 5 \
    --do_save_csv False \
    --do_test False \
    --output_dir ${output_path} \

Arguments

model_path: Path to the model to be evaluated (the model merged with LoRA in HF format)
cot: Whether to use chain-of-thought
few_shot: Whether to use few-shot
ntrain: Specifies the number of few-shot demos when few_shot=True (5-shot: ntrain=5); When few_shot=False, this argument does not have any effect
with_prompt: Whether input to the model contains the instruction prompt for Alpaca models
constrained_decoding: Since the standard answer format for C-Eval is option 'A'/'B'/'C'/'D', we provide two methods for extracting answers from models' outputs:
- constrained_decoding=True: Compute the probability that the first token generated by the model is 'A', 'B', 'C', 'D', and choose the one with the highest probability as the answer
- constrained_decoding=False: Extract the answer token from model's outputs with regular expressions
temperature: Temperature for decoding
n_times: The number of repeated evaluations. Folders will be generated under output_dir corresponding to the specified number of times
do_save_csv: Whether to save the model outputs, extracted answers, etc. in csv files
output_dir: Output path of results
do_test: Whether to evaluate on the valid or test set: evaluate on the valid set when do_test=False and on the test set when do_test=True

Evaluation Output

The evaluation script creates directories outputs\take* when finishing evaluation, where * is a number ranges from 0 to n_times-1, storing the results of the n_times repeated evaluations respectively.
In each outputs\take*, there will be a submission.json and a summary.json . If do_save_csv=True, there will be also 52 csv files that contain model outputs, extracted answers for each subject, etc.

submission.json stores generated answers in the official submission form, and can be submitted for evaluation:

{
    "computer_network": {
        "0": "A",
        "1": "B",
        ...
    },
      "marxism": {
        "0": "B",
        "1": "A",
        ...
      },
  	...
}

summary.json stores model evaluation results under 52 subjects, 4 broader categories and an overall average. For instance, The 'All' key at end the json file shows the overall average score:
```
  "All": {
    "score": 0.36701337295690933,
    "num": 1346,
  "correct": 494.0
}
```

where score is the overall accuracy, num is the total number of evaluation examples, and correct is the number of correct predictions.

⚠️ Note that when evaluating on the test set (do_test=True), score and correct are 0 since there are no labels available. The test set results require submitting the submission.json file to the official C-Eval. For detailed instructions, please refer to the official submission process provided by C-Eval.

中文文档

模型合并与转换
- 在线模型合并与转换（Colab）
- 手动模型合并与转换
模型量化、推理、部署
效果与评测
- 指令理解与生成效果
- C-Eval评测效果与脚本
训练细节
- 预训练脚本
- 指令精调脚本
常见问题

English Docs

Model Reconstruction
- Online conversion with Colab
- Manual Conversion
Model Quantization, Inference and Deployment
System Performance
- Instruction-following and Text Generation
- C-Eval
Training Details
- Pre-training Script
- SFT Script
FAQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly