Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT2.7B underperforming & weird behavior compared to flant5xl on image captioning? #676

Open
Thomas2419 opened this issue Mar 25, 2024 · 5 comments

Comments

@Thomas2419
Copy link

Thomas2419 commented Mar 25, 2024

Hello! I was finetuning from the pretrained_flant5xl and pretrained_opt2.7b models, much to my surprise the flant5xl model is excelling at creating correct labels, as my captions are actually a string of labels. My objective was to determine the feasibility of training these models on complicated interconnected labels, some of them having subcategories some of them not. Flan correctly produces all of the words at a very high accuracy while opt starts to randomly use characters like ~ that aren't present anywhere in my dataset, and replaces a couple sets of labels with "urchin". So anywhere where it would presumably predict labels 1 2 and 3 for example it says ~ urchin ~, which in my dataset are actually fantasy races. Which is clearly indicating the model understands that there should be the correct label in the spot as it follows a sort of logic and only replaces certain labels.

This is on a custom dataset of image-txt pairs that I implemented. It is a bit small around 2104 images, it actually represents labels that have about sub options so it's like 20 choices(average of 4 subcategories), 3 choices, 13 choices(average of 7 subcategories), 8 choices, 8 choices, 4 choices.

A couple notes, I am editing the dataset loader to provide text_input for opt as is normal in the lavis repository and for flan to have the prompt as text_input and caption as text_output. As well both models are being trained with the VIT model casted to bf16, which doesn't seem to have diminished the quality. Another thing is that the opt2.7b models loss is substantially lower than the flant5 model for some reason. Please let me know if anyone has any ideas as to how I can fix this, or suggestions of things to try. Thanks for the help!

@shams2023
Copy link

你好!我正在对 pretrained_flant5xl 和 pretrained_opt2.7b 模型进行微调,令我惊讶的是 flant5xl 模型擅长创建正确的标签,因为我的标题实际上是一串标签。我的目标是确定在复杂的互连标签上训练这些模型的可行性,其中一些模型有子类别,有些则没有。Flan 以非常高的准确度正确生成所有单词,而 opt 开始随机使用像 ~ 这样的字符,这些字符在我的数据集中不存在,并用“urchin”替换了几组标签。因此,在任何可能预测标签 1、2 和 3 的地方,例如它会说 ~ urchin ~,在我的数据集中,这实际上是幻想种族。这清楚地表明模型知道该位置应该有正确的标签,因为它遵循某种逻辑并且仅替换某些标签。

这是我实现的图像-txt 对的自定义数据集。它有点小,大约有2104张图像,它实际上代表了大约有子选项的标签,所以它就像20个选择(4个子类别的平均值)、3个选择、13个选择(7个子类别的平均值)、8个选择、8个选择、4个选择。

需要注意的是,我正在编辑数据集加载器,为 opt 提供 text_input,就像 lavis 存储库中的正常情况一样,并为 flan 提供提示作为 text_input 和标题作为 text_output。此外,这两个模型都使用转换为 bf16 的 VIT 模型进行训练,这似乎并没有降低质量。另一件事是,由于某种原因,opt2.7b 模型的损失大大低于 flant5 模型。如果有人对如何解决此问题有任何想法或尝试建议,请告诉我。谢谢您的帮助!

May I ask if I can gain some knowledge from you on how to fine tune BLIP2? I have currently collected some pedestrian images and hope to use the image captioning feature of BLIP2 to obtain text descriptions of these images. I hope to receive your clarification!
Thank you!

@Thomas2419
Copy link
Author

yes, my apologies I had seen your other comment and meant to respond to you. I will preface this all with I am no machine learning engineer I am and have been a machine learning hobbyist for the past 6 years. So sometimes i know how to get things to work but not why they work or what I may be messing up when I get them to work. I've learned largely from just asking questions to people who know how things work so this feels like quite the role reversal for me. There were a lot of edits I had to make so I'll go with an overview that i assume is most applicable to you, but without knowing whether you can run distributed runs and such there are a few preliminary steps I'm unsure if you need to take. On my rtx3090 and with certain oddities with the blip2 implementation i am totally unable to 100% utilize my gpu vram. Two of the fundamental changes that had to be made to make this work were: 1. Setting config to bf16, and editing eva_vit.py to cast vit to bf16 instead of fp16. This was due to when i set the vit_precision to fp16 I got nan instead of a normal loss. 2. Editing coca_caption to point towards my local dataset, this was admittedly a lazy fix you could just as easily make a new dataset builder and such. Some other miscellaneous fixes i had to do was set the model_type to use the pretrained not the coco finetuned as well as using 224 not 364 image size, unsure why that is as i can have 6 batch size at 224 and not one batch at 364 doesn't even make sense pixel wise as (364^2)/(224^2) does not equal 6 it's about 3 so in theory a batch size of 2 should be theoretically achievable. There may well be things i forgot many of the fixes i found in reported issues so please check if this doesn't work then let me know what the problem is. This also assumes your dataset is a folder containing matching image text file pairs. Edit the dataset loader if that is not the case. These were all converted to .txt files to be able to attached to this. Assuming you specifically want to use blip2 to do this, if any image captioning model works, open_clip coca_clip model also works very well and was much easier to test.
Here's my edited config:
caption_coco_ft.txt

Here's my edited coco_caption:
caption_datasets.txt

Here's my edited eva_vit.py:
eva_vit.txt
(only real edit to this is adding in a convert_weights_to_bf16

@shams2023
Copy link

shams2023 commented Mar 28, 2024

yes, my apologies I had seen your other comment and meant to respond to you. I will preface this all with I am no machine learning engineer I am and have been a machine learning hobbyist for the past 6 years. So sometimes i know how to get things to work but not why they work or what I may be messing up when I get them to work. I've learned largely from just asking questions to people who know how things work so this feels like quite the role reversal for me. There were a lot of edits I had to make so I'll go with an overview that i assume is most applicable to you, but without knowing whether you can run distributed runs and such there are a few preliminary steps I'm unsure if you need to take. On my rtx3090 and with certain oddities with the blip2 implementation i am totally unable to 100% utilize my gpu vram. Two of the fundamental changes that had to be made to make this work were: 1. Setting config to bf16, and editing eva_vit.py to cast vit to bf16 instead of fp16. This was due to when i set the vit_precision to fp16 I got nan instead of a normal loss. 2. Editing coca_caption to point towards my local dataset, this was admittedly a lazy fix you could just as easily make a new dataset builder and such. Some other miscellaneous fixes i had to do was set the model_type to use the pretrained not the coco finetuned as well as using 224 not 364 image size, unsure why that is as i can have 6 batch size at 224 and not one batch at 364 doesn't even make sense pixel wise as (364^2)/(224^2) does not equal 6 it's about 3 so in theory a batch size of 2 should be theoretically achievable. There may well be things i forgot many of the fixes i found in reported issues so please check if this doesn't work then let me know what the problem is. This also assumes your dataset is a folder containing matching image text file pairs. Edit the dataset loader if that is not the case. These were all converted to .txt files to be able to attached to this. Assuming you specifically want to use blip2 to do this, if any image captioning model works, open_clip coca_clip model also works very well and was much easier to test. Here's my edited config: caption_coco_ft.txt

Here's my edited coco_caption: caption_datasets.txt

Here's my edited eva_vit.py: eva_vit.txt (only real edit to this is adding in a convert_weights_to_bf16

Thank you for your reply. Your reply is very timely and important to me. Thank you again!
I have my own dataset (in the form of image text pairs), and the reason I want to use BLIP is that I have collected some evening pedestrian images myself, which are not described in text. I hope that the BLIP model can generate good text descriptions for this part of my images, thereby expanding the original dataset.
I have tried using the BLIP model and made some adjustments on it, but the results did not satisfy me. Therefore, I would like to try using a larger model, such as BLIP2, hoping that it can help me achieve satisfactory results.
Thank you again for your reply! Thank you!
My card is also 3090 (24G)! I'm lucky to be the same as yours.
My computer only has a 3090, so unfortunately, it does not support distributed training!

@Thomas2419
Copy link
Author

Thomas2419 commented Mar 28, 2024

@shams2023 I apologize I realize i have given you an incorrect caption datasets for utilizing flant5xl, this is what I is required as flant5xl needs text_input and text_output. this is at the end of the class CaptionDataset(BaseDataset, __DisplMixin): in the caption_datasets.py file:
return {
"image": self.vis_processor(image),
"text_input": (prefix),
"text_output": (caption),
}

@shams2023
Copy link

我很抱歉,我意识到我给了你一个不正确的标题数据集来使用 flant5xl,这就是我所需要的,因为 flant5xl 需要text_input和text_output。这是在类的末尾 CaptionDataset(BaseDataset, __DisplMixin): 在caption_datasets.py文件中: return { “image”: self.vis_processor(image), “text_input”: (prefix), “text_output”: (caption), }

You're right, bro! Thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants