New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
为什么基于openai api部署的llama2-7b-chat-hf,在MMLU数据集上测试精度远低于官方数据 #1037
Comments
I don't understand this part: models = [
dict(
type=OpenAI, # 使用 OpenAI 模型
# 以下为 `OpenAI` 初始化参数
path='/huggingface.co/meta-llama/Llama-2-7b-chat-hf', # 指定模型类型
openai_api_base = "http://localhost:8000/v1/chat/completions",
key='-', # OpenAI API Key
max_seq_len=2048, # 最大输入长度
# 以下参数为各类模型都有的参数,非 `OpenAI` 的初始化参数
abbr='Llama-2-7b', # 模型简称
run_cfg=dict(num_gpus=0), # 资源需求(不需要 GPU)
max_out_len=512, # 最长生成长度
batch_size=8, # 批次大小
),
] Why you use openai api in llama? i think these are 2 totally different models. You should download llama model to use. Please take llama config example for reference. |
I deployed the llama2 with vllm to provide an openai api externally, and I would like opencompass to be able to evaluate the model by api, rather than loading the local model directly for evaluation |
oh I see, can you share the prediction result? in your |
this is a part of the prediction in S_5.\nA. 8\nB. 2\nC. 24\nD. 120\nAnswer: ", |
It seems the problem is that your prompt template not work, here is my mmlu prediction:
As you see, the output is only ABCD options |
Do you know why this problem occurs? I observed the configuration file and there seems to be no problem with the prompt template, but it just does not take effect. dict(abbr='lukaemon_mmlu_college_biology', |
Maybe you need some tools to check, try https://github.com/open-compass/opencompass/blob/main/tools/prompt_viewer.py to see the prompt feed to your model? (Actually I don't know how to check prompt, but this tools looks like that it can be tried from its name, I am just a user and not a developer) |
because you are testing a "chat" model, which needs a specific "chat template" to work. not just the prompt template of the evalution. from the previous comments, you could figure out the "<|im_start|>" things, that's important. you are using the chat-completion api from openai, but probably not applied the correct chat template when configuring the inference service. |
先决条件
问题类型
我正在使用官方支持的任务/模型/数据集进行评估。
环境
torch 2.1.2
vllm 0.4.0
opencompass 0.2.3
重现问题 - 代码/配置示例
重现问题 - 命令或脚本
python run.py /home/evllm/opencompass/configs/api_examples/eval_api_openai.py
重现问题 - 错误信息
官方数据精度有45%,实际用opencompass测下来只有26%,和乱蒙没啥太大区别,是配置文件哪里写错了还是什么原因?
20240411_014034
tabulate format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dataset version metric mode Llama-2-7b
--------- 考试 Exam --------- - - - -
ceval - - - -
agieval - - - -
mmlu - naive_average gen 26.57
GaokaoBench - - - -
ARC-c - - - -
--------- 语言 Language --------- - - - -
WiC - - - -
summedits - - - -
chid-dev - - - -
afqmc-dev - - - -
bustm-dev - - - -
cluewsc-dev - - - -
WSC - - - -
winogrande - - - -
flores_100 - - - -
--------- 知识 Knowledge --------- - - - -
BoolQ - - - -
commonsense_qa - - - -
nq - - - -
triviaqa - - - -
--------- 推理 Reasoning --------- - - - -
cmnli - - - -
ocnli - - - -
ocnli_fc-dev - - - -
AX_b - - - -
AX_g - - - -
CB - - - -
RTE - - - -
story_cloze - - - -
COPA - - - -
ReCoRD - - - -
hellaswag - - - -
piqa - - - -
siqa - - - -
strategyqa - - - -
math - - - -
gsm8k - - - -
TheoremQA - - - -
openai_humaneval - - - -
mbpp - - - -
bbh - - - -
--------- 理解 Understanding --------- - - - -
C3 - - - -
CMRC_dev - - - -
DRCD_dev - - - -
MultiRC - - - -
race-middle - - - -
race-high - - - -
openbookqa_fact - - - -
csl_dev - - - -
lcsts - - - -
Xsum - - - -
eprstmt-dev - - - -
lambada - - - -
tnews-dev - - - -
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
-------------------------------------------------------------------------------------------------------------------------------- THIS IS A DIVIDER --------------------------------------------------------------------------------------------------------------------------------
csv format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dataset,version,metric,mode,Llama-2-7b
--------- 考试 Exam ---------,-,-,-,-
ceval,-,-,-,-
agieval,-,-,-,-
mmlu,-,naive_average,gen,26.57
GaokaoBench,-,-,-,-
ARC-c,-,-,-,-
--------- 语言 Language ---------,-,-,-,-
WiC,-,-,-,-
summedits,-,-,-,-
chid-dev,-,-,-,-
afqmc-dev,-,-,-,-
bustm-dev,-,-,-,-
cluewsc-dev,-,-,-,-
WSC,-,-,-,-
winogrande,-,-,-,-
flores_100,-,-,-,-
--------- 知识 Knowledge ---------,-,-,-,-
BoolQ,-,-,-,-
commonsense_qa,-,-,-,-
nq,-,-,-,-
triviaqa,-,-,-,-
--------- 推理 Reasoning ---------,-,-,-,-
cmnli,-,-,-,-
ocnli,-,-,-,-
ocnli_fc-dev,-,-,-,-
AX_b,-,-,-,-
AX_g,-,-,-,-
CB,-,-,-,-
RTE,-,-,-,-
story_cloze,-,-,-,-
COPA,-,-,-,-
ReCoRD,-,-,-,-
hellaswag,-,-,-,-
piqa,-,-,-,-
siqa,-,-,-,-
strategyqa,-,-,-,-
math,-,-,-,-
gsm8k,-,-,-,-
TheoremQA,-,-,-,-
openai_humaneval,-,-,-,-
mbpp,-,-,-,-
bbh,-,-,-,-
--------- 理解 Understanding ---------,-,-,-,-
C3,-,-,-,-
CMRC_dev,-,-,-,-
DRCD_dev,-,-,-,-
MultiRC,-,-,-,-
race-middle,-,-,-,-
race-high,-,-,-,-
openbookqa_fact,-,-,-,-
csl_dev,-,-,-,-
lcsts,-,-,-,-
Xsum,-,-,-,-
eprstmt-dev,-,-,-,-
lambada,-,-,-,-
tnews-dev,-,-,-,-
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
-------------------------------------------------------------------------------------------------------------------------------- THIS IS A DIVIDER --------------------------------------------------------------------------------------------------------------------------------
raw format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Model: Llama-2-7b
lukaemon_mmlu_college_biology: {'accuracy': 17.36111111111111}
lukaemon_mmlu_college_chemistry: {'accuracy': 40.0}
lukaemon_mmlu_college_computer_science: {'accuracy': 32.0}
lukaemon_mmlu_college_mathematics: {'accuracy': 27.0}
lukaemon_mmlu_college_physics: {'accuracy': 20.588235294117645}
lukaemon_mmlu_electrical_engineering: {'accuracy': 24.82758620689655}
lukaemon_mmlu_astronomy: {'accuracy': 30.92105263157895}
lukaemon_mmlu_anatomy: {'accuracy': 20.74074074074074}
lukaemon_mmlu_abstract_algebra: {'accuracy': 28.999999999999996}
lukaemon_mmlu_machine_learning: {'accuracy': 31.25}
lukaemon_mmlu_clinical_knowledge: {'accuracy': 21.88679245283019}
lukaemon_mmlu_global_facts: {'accuracy': 31.0}
lukaemon_mmlu_management: {'accuracy': 37.86407766990291}
lukaemon_mmlu_nutrition: {'accuracy': 24.18300653594771}
lukaemon_mmlu_marketing: {'accuracy': 29.059829059829063}
lukaemon_mmlu_professional_accounting: {'accuracy': 23.404255319148938}
lukaemon_mmlu_high_school_geography: {'accuracy': 18.68686868686869}
lukaemon_mmlu_international_law: {'accuracy': 23.96694214876033}
lukaemon_mmlu_moral_scenarios: {'accuracy': 26.145251396648046}
lukaemon_mmlu_computer_security: {'accuracy': 28.000000000000004}
lukaemon_mmlu_high_school_microeconomics: {'accuracy': 20.588235294117645}
lukaemon_mmlu_professional_law: {'accuracy': 25.03259452411995}
lukaemon_mmlu_medical_genetics: {'accuracy': 26.0}
lukaemon_mmlu_professional_psychology: {'accuracy': 22.712418300653596}
lukaemon_mmlu_jurisprudence: {'accuracy': 26.851851851851855}
lukaemon_mmlu_world_religions: {'accuracy': 21.637426900584796}
lukaemon_mmlu_philosophy: {'accuracy': 31.189710610932476}
lukaemon_mmlu_virology: {'accuracy': 24.69879518072289}
lukaemon_mmlu_high_school_chemistry: {'accuracy': 26.108374384236456}
lukaemon_mmlu_public_relations: {'accuracy': 20.909090909090907}
lukaemon_mmlu_high_school_macroeconomics: {'accuracy': 20.76923076923077}
lukaemon_mmlu_human_sexuality: {'accuracy': 26.717557251908396}
lukaemon_mmlu_elementary_mathematics: {'accuracy': 21.164021164021165}
lukaemon_mmlu_high_school_physics: {'accuracy': 30.4635761589404}
lukaemon_mmlu_high_school_computer_science: {'accuracy': 34.0}
lukaemon_mmlu_high_school_european_history: {'accuracy': 41.81818181818181}
lukaemon_mmlu_business_ethics: {'accuracy': 21.0}
lukaemon_mmlu_moral_disputes: {'accuracy': 25.14450867052023}
lukaemon_mmlu_high_school_statistics: {'accuracy': 43.51851851851852}
lukaemon_mmlu_miscellaneous: {'accuracy': 28.735632183908045}
lukaemon_mmlu_formal_logic: {'accuracy': 19.047619047619047}
lukaemon_mmlu_high_school_government_and_politics: {'accuracy': 19.689119170984455}
lukaemon_mmlu_prehistory: {'accuracy': 20.061728395061728}
lukaemon_mmlu_security_studies: {'accuracy': 17.142857142857142}
lukaemon_mmlu_high_school_biology: {'accuracy': 18.387096774193548}
lukaemon_mmlu_logical_fallacies: {'accuracy': 33.12883435582822}
lukaemon_mmlu_high_school_world_history: {'accuracy': 24.894514767932492}
lukaemon_mmlu_professional_medicine: {'accuracy': 37.13235294117647}
lukaemon_mmlu_high_school_mathematics: {'accuracy': 24.814814814814813}
lukaemon_mmlu_college_medicine: {'accuracy': 20.809248554913296}
lukaemon_mmlu_high_school_us_history: {'accuracy': 34.80392156862745}
lukaemon_mmlu_sociology: {'accuracy': 27.363184079601986}
lukaemon_mmlu_econometrics: {'accuracy': 23.684210526315788}
lukaemon_mmlu_high_school_psychology: {'accuracy': 35.22935779816514}
lukaemon_mmlu_human_aging: {'accuracy': 20.62780269058296}
lukaemon_mmlu_us_foreign_policy: {'accuracy': 28.000000000000004}
lukaemon_mmlu_conceptual_physics: {'accuracy': 32.76595744680851}
mmlu-humanities: {'naive_average': 27.209468158205258}
mmlu-stem: {'naive_average': 28.047951855051497}
mmlu-social-science: {'naive_average': 23.45767749414954}
mmlu-other: {'naive_average': 26.646291737612497}
mmlu: {'naive_average': 26.57066831265621}
mmlu-weighted: {'weighted_average': 26.157242558040164}
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
其他信息
No response
The text was updated successfully, but these errors were encountered: