Skip to content

Latest commit

 

History

History
148 lines (94 loc) · 6.19 KB

evaluation_full.md

File metadata and controls

148 lines (94 loc) · 6.19 KB

Evaluation for full-parameter tuning models

Note that change conv-mode to minicpm/phi3/llama for MODEL_TYPE = minicpm/phi-3/llama3-8b.

MME

  1. Refer to MME GitHub to download the benchmark dataset and put MME_Benchmark_release_version under eval/mme.
  2. Update MODEL_TYPE and TARGET_DIR accordingly.
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mme.sh

The responses and scores can be found in eval/mme/answers_upload.

MMBench & MMBench-Chinese

  1. Refer to MMBench GitHub to download the benchmark dataset. We support MMBench-Dev, MMBench-Test, MMBench-Dev (cn) and MMBench-Test (cn). Please note that only the files downloaded by legacy link are supported. Put MMBench_DEV_EN_legacy.tsv, MMBench_TEST_EN_legacy.tsv, MMBench_DEV_CN_legacy.tsv or MMBench_TEST_CN_legacy.tsv under eval/mmbench.
  2. Update SPLIT, LANG (en/cn), MODEL_TYPE and TARGET_DIR accordingly.
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mmbench.sh

The response file can be found in eval/mmbench/answers_upload. You can submit the Excel file to submission link to obtain the evaluation scores.

SEED-Bench-1

  1. Refer to SEED-Bench Instruction to download the images and videos and put the images under eval/seed-bench/SEED-Bench-image and the videos under eval/seed-bench/SEED-Bench-video. Then, extract the video frames in the middle from the downloaded videos by running:

    pip install av decord
    python eval/seed-bench/extract_video_frames.py
  2. Update MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/full/seedbench.sh

The response file can be found in eval/seed-bench/answers_upload and the scores can be found in eval/seed-bench/scores.

MMMU

  1. Refer to MMMU HuggingFace to download the benchmark dataset and put MMMU under eval/mmmu.
  2. Update SPLIT, MODEL_TYPE and TARGET_DIR accordingly. You may add --small-gpu-usage to avoid CUDA out of memory.
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mmmu.sh

The response file can be found in eval/mmmu/answers_upload.

For validation set, you can use eval_mmmu.py to obtain the scores.

python eval/mmmu/eval_mmmu.py \
	--output-path ./eval/mmmu/answers_upload/$SPLIT/$TARGET_DIR.json

For test set, you can submit the json response file to submission_link to obtain the evaluation scores.

CMMMU

  1. Refer to CMMMU HuggingFace to download the benchmark dataset and put CMMMU under eval/cmmmu.
  2. Update SPLIT, MODEL_TYPE and TARGET_DIR accordingly. You may add --small-gpu-usage to avoid CUDA out of memory.
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/cmmmu.sh

The response file can be found in eval/cmmmu/answers_upload.

For validation set, you can use eval_script.py to obtain the scores.

python eval/cmmmu/eval_script.py \
	--output_path ./eval/cmmmu/answers_upload/$SPLIT/$TARGET_DIR.jsonl

For test set, you can submit the jsonl response file to submission_link to obtain the evaluation scores.

VQAv2

  1. Download COCO 2015 Test images and put test2015 under eval/vqav2. Then:

    tar -zxvf eval/vqav2/bunny_vqav2_mscoco_test2015.tar.gz -C eval/vqav2 && rm eval/vqav2/bunny_vqav2_mscoco_test2015.tar.gz && tar -zxvf eval/vqav2/bunny_vqav2_mscoco_test-dev2015.tar.gz -C eval/vqav2 && rm eval/vqav2/bunny_vqav2_mscoco_test-dev2015.tar.gz
  2. Update MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/full/vqav2.sh

The response file can be found in eval/vqav2/answers_upload. You can submit the json response file to submission link (Test-Dev Phase) to obtain the evaluation scores.

GQA

  1. Download the images of GQA, unzip it and put images under eval/gqa. Then:

    tar -zxvf eval/gqa/testdev_balanced_questions.tar.gz -C eval/gqa && rm eval/gqa/testdev_balanced_questions.tar.gz
  2. Update MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/full/gqa.sh

ScienceQA-IMG

  1. Refer to ScienceQA Google Drive to download test.zip, problems.json and pid_splits.json, unzip test.zip and put them under eval/scienceqa.
  2. Update MODEL_TYPE and TARGET_DIR accordingly.
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/scienceqa.sh

The responses and the scores can be found in eval/scienceqa/results.

POPE

  1. Download COCO 2014 Val images and put val2014 under eval/pope. Then, refer to POPE GitHub to download the benchmark dataset and put the three json files under eval/pope/coco.
  2. Update MODEL_TYPE and TARGET_DIR accordingly.
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/pope.sh

We report the averaged F1-score of three categories (random, popular and adversarial).

MM-Vet

  1. Refer to MM-Vet Github to download the benchmark dataset and put images under eval/mm-vet.
  2. Update MODEL_TYPE and TARGET_DIR accordingly.
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mmvet.sh

The response file can be found in eval/mm-vet/answers_upload. You can submit the json response file to submission link to obtain the evaluation scores.