what should be samples["text_output"] during finetuning #695

abhidipbhattacharyya · 2024-04-26T13:50:25Z

Hi,
I am trying to fine-tune blip2 with coco. In the caption_dataset there is a sample['input_text']. However the forward method for blip2_t5 expects sample['output_text']. The prompt is being attached to the GT caption by the processor and stored as sample[''input_text'']. I was hoping that the prompt "A photo of" should be input text and the GT caption will be the output text during fine tuning. This is implemented in 'coco_caption_instruct' dataset. However the example yml file for finetuning is using 'coco_caption'. Should I change this ?

This brings another question to me. If I need to pretrain (stage2) with prefix language modeling which python file (dataset) should I used. Again neither of coco_caption nor coco_caption_instruct has any splitting of the caption into prefix, suffix text.

Please advise.

Thanks in advance,
Abhidip

Thomas2419 · 2024-04-29T18:26:38Z

Hello, so from what I've gathered using this repo so far, FlanT5 models expect a text_input and a text_output unlike the OPT models for blip2 which just expect image input and text input for training. Text_input could be empty even but generally best used as a grounded or task specific question or prompt for the text generation. So text output would be the caption/answer and text_input could be empty could be set as a generic question for every image such as text_input = "What caption best describes this image?" As well generally while using flant5 as stated in the paper you want to use in text input Question: {} Short answer: for the text input in this case it could be something like this:

class CaptionDataset(BaseDataset, __DisplMixin):
    def __init__(self, vis_processor, text_processor, vis_root, ann_paths):
        """
        vis_root (string): Root directory of images (e.g. coco/images/)
        ann_root (string): directory to store the annotation file
        """
        super().__init__(vis_processor, text_processor, vis_root, ann_paths)

        self.img_ids = {}
        n = 0
        for ann in self.annotation:
            img_id = ann["image_id"]
            if img_id not in self.img_ids.keys():
                self.img_ids[img_id] = n
                n += 1

    def __getitem__(self, index):

        # TODO this assumes image input, not general enough
        ann = self.annotation[index]

        image_path = os.path.join(self.vis_root, ann["image"])
        try:
            image = Image.open(image_path).convert("RGB")
        except:
            return None # image does not exist

        image = self.vis_processor(image)
        caption = self.text_processor(ann["caption"])
        
        query =  "Question: What caption best describes this image? Short answer:"

        return {
            "image": image,
            "text_output": caption,
            "text_input": query,
            "image_id": ann["image_id"]
        }

For note the "Question: Short answer:" format according to the paper is specifically for VQA tasks for captioning I'm sure you can use "A photo of" as the text_input as stated.

abhidipbhattacharyya · 2024-04-30T16:05:19Z

Hello, so from what I've gathered using this repo so far, FlanT5 models expect a text_input and a text_output unlike the OPT models for blip2 which just expect image input and text input for training. Text_input could be empty even but generally best used as a grounded or task specific question or prompt for the text generation. So text output would be the caption/answer and text_input could be empty could be set as a generic question for every image such as text_input = "What caption best describes this image?" As well generally while using flant5 as stated in the paper you want to use in text input Question: {} Short answer: for the text input in this case it could be something like this:
class CaptionDataset(BaseDataset, __DisplMixin):
    def __init__(self, vis_processor, text_processor, vis_root, ann_paths):
        """
        vis_root (string): Root directory of images (e.g. coco/images/)
        ann_root (string): directory to store the annotation file
        """
        super().__init__(vis_processor, text_processor, vis_root, ann_paths)

        self.img_ids = {}
        n = 0
        for ann in self.annotation:
            img_id = ann["image_id"]
            if img_id not in self.img_ids.keys():
                self.img_ids[img_id] = n
                n += 1

    def __getitem__(self, index):

        # TODO this assumes image input, not general enough
        ann = self.annotation[index]

        image_path = os.path.join(self.vis_root, ann["image"])
        try:
            image = Image.open(image_path).convert("RGB")
        except:
            return None # image does not exist

        image = self.vis_processor(image)
        caption = self.text_processor(ann["caption"])
        
        query =  "Question: What caption best describes this image? Short answer:"

        return {
            "image": image,
            "text_output": caption,
            "text_input": query,
            "image_id": ann["image_id"]
        }
For note the "Question: Short answer:" format according to the paper is specifically for VQA tasks for captioning I'm sure you can use "A photo of" as the text_input as stated.

Thank you @Thomas2419. I am exactly doing this. I am fine-tuning for caption generation. So my input text is 'A photo of'. Thank you for giving a detailed code.

But one thing that I am still confused with is- during pre-training with prefix language modeling, how do we prepare the input as suggested in figure 3 (Bottom)? Is there a ratio on how much of text will go to the encoder and how much will go to the decoder? I could not find any dataset python file in the repo that is implementing the behavior of the bottom figure of figure 3 from the paper. Any hints/directions will be helpful. Thank you.

Thomas2419 · 2024-05-02T04:18:17Z

@abhidipbhattacharyya From what I am seeing and understand, I'm not sure how exactly the authors of the paper, or maintainers of this repository would implement this. I decided to check some other papers on how they did it, and it seems very situationally dependent. If trying to mimic the outcomes of the BLIP-2 paper, I'd suggest either deciding on a percent of the text to make the prefix or adding a flat prefix to use on all text inputs such as the "A photo of". Goodluck.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what should be samples["text_output"] during finetuning #695

what should be samples["text_output"] during finetuning #695

abhidipbhattacharyya commented Apr 26, 2024 •

edited

Thomas2419 commented Apr 29, 2024 •

edited

abhidipbhattacharyya commented Apr 30, 2024 •

edited

Thomas2419 commented May 2, 2024

what should be samples["text_output"] during finetuning #695

what should be samples["text_output"] during finetuning #695

Comments

abhidipbhattacharyya commented Apr 26, 2024 • edited

Thomas2419 commented Apr 29, 2024 • edited

abhidipbhattacharyya commented Apr 30, 2024 • edited

Thomas2419 commented May 2, 2024

abhidipbhattacharyya commented Apr 26, 2024 •

edited

Thomas2419 commented Apr 29, 2024 •

edited

abhidipbhattacharyya commented Apr 30, 2024 •

edited