Incorrect model inference (what went wrong in my setup) #145

jennyziyi-xu · 2024-03-05T01:06:50Z

Hello,

1, I have set up Video-LLaMA from this repo. I have downloaded all checkpoints for inference:

I am using VL_LLAMA_2_7B_Finetuned.pth and llama-2-7b-chat-hf from this Hugging Face repo.
The BLIP2-q-former model is from https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth
The VIT-g vision encoder is from https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth

The Gradio Demo demo_video.py runs perfectly fine, there’s no errors for loading in all these vision encoder, BLIP2, LLAMA-2 weights.

2, However, the inference gives totally wrong answer and it seems that the model is not using the vision encoder. For example, asking the model about this following photo gives irrelevant answers.

3, I wonder what may have been wrong in my setup.

Thank you very much!!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect model inference (what went wrong in my setup) #145

Incorrect model inference (what went wrong in my setup) #145

jennyziyi-xu commented Mar 5, 2024

Incorrect model inference (what went wrong in my setup) #145

Incorrect model inference (what went wrong in my setup) #145

Comments

jennyziyi-xu commented Mar 5, 2024