Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what should be samples["text_output"] during finetuning #695

Open
abhidipbhattacharyya opened this issue Apr 26, 2024 · 3 comments
Open

Comments

@abhidipbhattacharyya
Copy link

abhidipbhattacharyya commented Apr 26, 2024

Hi,
I am trying to fine-tune blip2 with coco. In the caption_dataset there is a sample['input_text']. However the forward method for blip2_t5 expects sample['output_text']. The prompt is being attached to the GT caption by the processor and stored as sample[''input_text'']. I was hoping that the prompt "A photo of" should be input text and the GT caption will be the output text during fine tuning. This is implemented in 'coco_caption_instruct' dataset. However the example yml file for finetuning is using 'coco_caption'. Should I change this ?

This brings another question to me. If I need to pretrain (stage2) with prefix language modeling which python file (dataset) should I used. Again neither of coco_caption nor coco_caption_instruct has any splitting of the caption into prefix, suffix text.

Please advise.

Thanks in advance,
Abhidip

@Thomas2419
Copy link

Thomas2419 commented Apr 29, 2024

Hello, so from what I've gathered using this repo so far, FlanT5 models expect a text_input and a text_output unlike the OPT models for blip2 which just expect image input and text input for training. Text_input could be empty even but generally best used as a grounded or task specific question or prompt for the text generation. So text output would be the caption/answer and text_input could be empty could be set as a generic question for every image such as text_input = "What caption best describes this image?" As well generally while using flant5 as stated in the paper you want to use in text input Question: {} Short answer: for the text input in this case it could be something like this:

class CaptionDataset(BaseDataset, __DisplMixin):
    def __init__(self, vis_processor, text_processor, vis_root, ann_paths):
        """
        vis_root (string): Root directory of images (e.g. coco/images/)
        ann_root (string): directory to store the annotation file
        """
        super().__init__(vis_processor, text_processor, vis_root, ann_paths)

        self.img_ids = {}
        n = 0
        for ann in self.annotation:
            img_id = ann["image_id"]
            if img_id not in self.img_ids.keys():
                self.img_ids[img_id] = n
                n += 1

    def __getitem__(self, index):

        # TODO this assumes image input, not general enough
        ann = self.annotation[index]

        image_path = os.path.join(self.vis_root, ann["image"])
        try:
            image = Image.open(image_path).convert("RGB")
        except:
            return None # image does not exist

        image = self.vis_processor(image)
        caption = self.text_processor(ann["caption"])
        
        query =  "Question: What caption best describes this image? Short answer:"

        return {
            "image": image,
            "text_output": caption,
            "text_input": query,
            "image_id": ann["image_id"]
        }

For note the "Question: Short answer:" format according to the paper is specifically for VQA tasks for captioning I'm sure you can use "A photo of" as the text_input as stated.

@abhidipbhattacharyya
Copy link
Author

abhidipbhattacharyya commented Apr 30, 2024

Hello, so from what I've gathered using this repo so far, FlanT5 models expect a text_input and a text_output unlike the OPT models for blip2 which just expect image input and text input for training. Text_input could be empty even but generally best used as a grounded or task specific question or prompt for the text generation. So text output would be the caption/answer and text_input could be empty could be set as a generic question for every image such as text_input = "What caption best describes this image?" As well generally while using flant5 as stated in the paper you want to use in text input Question: {} Short answer: for the text input in this case it could be something like this:

class CaptionDataset(BaseDataset, __DisplMixin):
    def __init__(self, vis_processor, text_processor, vis_root, ann_paths):
        """
        vis_root (string): Root directory of images (e.g. coco/images/)
        ann_root (string): directory to store the annotation file
        """
        super().__init__(vis_processor, text_processor, vis_root, ann_paths)

        self.img_ids = {}
        n = 0
        for ann in self.annotation:
            img_id = ann["image_id"]
            if img_id not in self.img_ids.keys():
                self.img_ids[img_id] = n
                n += 1

    def __getitem__(self, index):

        # TODO this assumes image input, not general enough
        ann = self.annotation[index]

        image_path = os.path.join(self.vis_root, ann["image"])
        try:
            image = Image.open(image_path).convert("RGB")
        except:
            return None # image does not exist

        image = self.vis_processor(image)
        caption = self.text_processor(ann["caption"])
        
        query =  "Question: What caption best describes this image? Short answer:"

        return {
            "image": image,
            "text_output": caption,
            "text_input": query,
            "image_id": ann["image_id"]
        }

For note the "Question: Short answer:" format according to the paper is specifically for VQA tasks for captioning I'm sure you can use "A photo of" as the text_input as stated.

Thank you @Thomas2419. I am exactly doing this. I am fine-tuning for caption generation. So my input text is 'A photo of'. Thank you for giving a detailed code.

But one thing that I am still confused with is- during pre-training with prefix language modeling, how do we prepare the input as suggested in figure 3 (Bottom)? Is there a ratio on how much of text will go to the encoder and how much will go to the decoder? I could not find any dataset python file in the repo that is implementing the behavior of the bottom figure of figure 3 from the paper. Any hints/directions will be helpful. Thank you.

@Thomas2419
Copy link

@abhidipbhattacharyya From what I am seeing and understand, I'm not sure how exactly the authors of the paper, or maintainers of this repository would implement this. I decided to check some other papers on how they did it, and it seems very situationally dependent. If trying to mimic the outcomes of the BLIP-2 paper, I'd suggest either deciding on a percent of the text to make the prefix or adding a flat prefix to use on all text inputs such as the "A photo of". Goodluck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants