Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The test results of lynx on the MSCOCO ITM task are questionable #1

Open
OPilgrim opened this issue Nov 1, 2023 · 4 comments
Open

Comments

@OPilgrim
Copy link

OPilgrim commented Nov 1, 2023

First of all, thank you for a great job! I ran into a few issues while following the tutorial to reproduce:

I first follow tutorial to emersion lynx ACC on MSCOCO_ITM task, that is, Table18 in the paper. I used the following command:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2  run_eval.py \
    --model lynx  --model_name models/interfaces/lynx/configs/LYNX.yaml \
    --dataset_name MSCOCO --output_dir output/lynx/MSCOCO/test_generation/ \
    --per_gpu_eval_batch_size 4 --formulation SingleChoice \
    --infer_method generation --do_eval --half_evaluation  --dataset_duplication 1 \
    --in_context_sample --option_mark upper \
    --dataset_config build/configs/ImageTextMatching_val.yaml \
    --offline_hf

I used generation as the inference method, but the results I get were rather strange:

2023-11-01 16:00:35,236 ReForm-Eval Evaluation INFO: the evalueted SingleChoice result: 0.0
2023-11-01 16:00:35,236 ReForm-Eval Evaluation INFO: the format hit rate is 0.0

If I use likelihood as the inference method, the results are still different from that in the paper:

2023-11-01 15:39:14,806 ReForm-Eval Evaluation INFO: the evalueted SingleChoice result: 0.5183333333333333
2023-11-01 15:39:14,806 ReForm-Eval Evaluation INFO: the format hit rate is 1.0

I'm at a loss to understand, and I hope you can help to point out where the problem may be.

@Aweminus
Copy link
Contributor

Aweminus commented Nov 1, 2023

Hello, thanks for trying our benchmark!

I ran the command mentioned above and got the reasonable result in generation:

2023-11-01 19:52:21,574 ReForm-Eval Evaluation INFO: the evalueted SingleChoice result: 0.49833333333333335
2023-11-01 19:52:21,575 ReForm-Eval Evaluation INFO: the format hit rate is 0.95

Can you share some samples of the output json files?

When you use likelihood as the inference mode, you do not need to add --in_context_sample, and you need to change the number of --dataset_duplication

@OPilgrim
Copy link
Author

Sorry for taking so long to reply. Recently, the machine broke down and I have not been able to do the experiment.
I checked the output, but the model output is quite confusing:
The log.txt:

2023-11-13 16:08:57,188 ReForm-Eval Evaluation INFO: Evaluating with -1 GPUs
2023-11-13 16:08:57,189 ReForm-Eval Evaluation INFO: Loading model: lynx with configure: {"device": "cuda", "half": true, "inference_method": "generation", "model_name": "models/interfaces/lynx/configs/LYNX.yaml"}
2023-11-13 16:10:12,067 ReForm-Eval Evaluation INFO: Each GPU consumes memory of 17025
2023-11-13 16:10:12,067 ReForm-Eval Evaluation INFO: Using upper option mark for the single-choice questions
2023-11-13 16:10:12,081 ReForm-Eval Evaluation INFO: Evaluating model: lynx with configure: {"device": "cuda", "half": true, "inference_method": "generation", "model_name": "models/interfaces/lynx/configs/LYNX.yaml"}
2023-11-13 16:10:13,420 ReForm-Eval Evaluation INFO: ***** Runing Evaluation *****
2023-11-13 16:10:13,421 ReForm-Eval Evaluation INFO:   Num examples = 600
2023-11-13 16:10:13,421 ReForm-Eval Evaluation INFO:   Batch size = -4

The MSCOCO_SingleChoice_generation_lynx_LYNX_rank-1.json:

...... {"sample_id": 599, "anno": "Two women standing next to each other with one holding video game controllers.", "answer": "1", "answer_options": ["no", "yes"], "question": "Are the image and caption '{}' representing the same scene? Kindly respond with one following option.", "history": [{"from": "human", "value": "What is the shape of this image? Options: (A) rectangle; (B) circle."}, {"from": "assistant", "value": "The answer is (A) rectangle;"}], "text": "User: What is the shape of this image? Options: (A) rectangle; (B) circle.\nBot: The answer is (A) rectangle;\nUser: Are the image and caption '{}' representing the same scene? Kindly respond with one following option. Options: (A) no; (B) yes.\nBot: The answer is", "question_with_option": "Are the image and caption '{}' representing the same scene? Kindly respond with one following option. Options: (A) no; (B) yes.", "prediction": "js\u0447\u0430\u00e3aftableever\u0446\u0438\u043djs \u0447\u0435 -js\u5206ils Arcjsarcathcienttresjs\u00e3ostildarcangular \u0447\u0435~SelectorFIX compared"}]

Maybe the pre-trained model of lynx is not loaded correctly. Are your LYNX Settings the same as mine?
The LYNX.yaml

## Data
image_rdir: "./images/"
# put your test file in jsonl format
test_files: [ "./data/Open_VQA_images.jsonl" ]
# change this prompt for different task
prompt: "User: {question}\nBot:"
# the key must match the vision key in test_files
# if you test Open_VQA_videos.jsonl, need to change to "video"
vision_prompt_dict: "image"
output_prompt_dict: "answer"
data: {
  num_frames: 5,
}
## Model
vision_encoder: 'eva_vit_1b'
video_encoding: 'concate'
add_frame_pos: True
LLM: 'vicuna-7b'
use_flash_attn: False
use_adapter: True
adapter_freq: 2
bridge: 'resampler'
bridge_depth: 3
num_bridge_tokens: 32
## General
use_left_pad: True
lower_text: True
freeze_vit: True
freeze_llm: True
image_res: 420
image_mean: [ 0.48145466, 0.4578275, 0.40821073 ]
image_std: [ 0.26862954, 0.26130258, 0.27577711 ]
## Testing
checkpoint: "./data/finetune_lynx.pt"
## infer params
max_input_tokens: 40
batch_size_test: 16
max_new_tokens: 64
min_length: 2
num_beams: 5
length_penalty: -2.0
top_p: 0.9
top_k: 3
no_repeat_ngram_size: 2
apply_lemmatizer: False
use_nucleus_sampling: True

Also, when I run it, it shows that the adapter parameters have been reinitialized. Is this normal?

### Building LLM (Freeze: True)
### LLM label_smoothing:  0.0
### Use Flash Attn False
### Add adapters to:  [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.43s/it]
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at models/interfaces/lynx/data/vicuna-7b and are newly initialized: ['model.layers.28.output_adapter.adapter_up.bias', 'model.layers.0.output_adapter.adapter_up.weight', 'model.layers.4.output_adapter.adapter_norm_before.bias', 'model.layers.30.output_adapter.adapter_up.bias', 'model.layers.14.output_adapter.adapter_down.bias', 'model.layers.28.output_adapter.adapter_norm_before.weight', ......

@Aweminus
Copy link
Contributor

Our LYNX.yaml is shown below:

## Data
image_rdir: "./images/"

# put your test file in jsonl format
test_files: [ "./data/Open_VQA_images.jsonl" ]

# change this prompt for different task
prompt: "User: {question}\nBot:"
# the key must match the vision key in test_files
# if you test Open_VQA_videos.jsonl, need to change to "video"
vision_prompt_dict: "image"
output_prompt_dict: "answer"

data: {
  num_frames: 5,
}


## Model
vision_encoder: 'eva_vit_1b'
video_encoding: 'concate'
add_frame_pos: True


LLM: 'vicuna-7b'
LLM_base: '/remote-home/share/LLM_CKPT/vicuna-7B-v1.1/'
use_flash_attn: False
use_adapter: True
adapter_freq: 2


bridge: 'resampler'
bridge_depth: 3
num_bridge_tokens: 32


## General
use_left_pad: True
lower_text: True
freeze_vit: True
freeze_llm: True
image_res: 224
image_mean: [ 0.48145466, 0.4578275, 0.40821073 ]
image_std: [ 0.26862954, 0.26130258, 0.27577711 ]



## Testing
checkpoint: "/remote-home/share/multimodal-models/lynx/finetune_lynx.pt"

## infer params
max_input_tokens: 40
batch_size_test: 16
max_new_tokens: 64
min_length: 2
num_beams: 5
length_penalty: -2.0
top_p: 0.9
top_k: 3
no_repeat_ngram_size: 2
apply_lemmatizer: False
use_nucleus_sampling: True

Have you put the lynx repository in /path/to/ReForm-Eval/models/interfaces/lynx ?

@OPilgrim
Copy link
Author

Yes, I clone it from https://github.com/bytedance/lynx-llm.git

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants