Skip to content

vyskocj/VinVL-L

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VinVL+L: Enriching Visual Representation with Location Context in VQA

Examples

Introduction

We describe a novel method - VinVL+L - that enriches the visual representations (i.e. object tags and region features) of the State-of-the-Art Vision and Language (VL) method - VinVL - with Location information. To verify the importance of such metadata for VL models, we:

  1. trained a Swin-B model on the Places365 dataset and obtained additional sets of visual and tag features; both were made public to allow reproducibility and further experiments,

  2. did an architectural update to the existing VinVL method to include the new feature sets,

  3. provide a qualitative and quantitative evaluation.

By including just binary location metadata, the VinVL+L method provides incremental improvement to the State-of-the-Art VinVL in Visual Question Answering (VQA). The VinVL+L achieved an accuracy of 64.85% and increased the performance by +0.32% in terms of accuracy on the GQA dataset; the statistical significance of the new representations is verified via Approximate Randomization.

Download

Location features

Swin-Base AccuracyC AccuracyIO Scene tags Other Links
Best accuracy 56.0% 96.1% C | IO model | scene features
Best loss / Final epoch 53.3% 95.5% C | IO model | scene features

Notes:
#1: C refers to 365 location categories, and IO to their indoors/outdoors supercategories.
#2: The listed results are on the Places365 validation dataset; C → IO can be found here.

VinVL+L

Scene tags Accuracy Binary Open Links
C 64.85% 82.59% 49.19% model | server results
C+IO 64.71% 82.38% 49.12% model | server results
IO 64.65% 82.44% 48.94% model | server results
(reproduced VinVL) 64.53% 82.36% 48.79% model | server results

Notes:
#1: The listed models do not use scene features. In the previous section are links to all features.
#2: The listed results are on the GQA test2019 dataset; an official leaderboard can be found here.

Usage

Follow Oscar's instructions for installation and download their data for GQA dataset (Oscar/VinVL). Additionally, you will need the Timm library in case of using our Swin-B model, and h5py if you want to use the scene features.

To run VinVL+L script, you can follow the original usage (see section GQA); you must type python run_gqa_places.py, then continue in using the same arguments and consider using the following ones:

  • --wandb_entity "name" in case you want to log your run in WandB, where "name" refers to your profile name. Do not forget that you will need to login during the first run via console.
  • --places_io_json "path_to_io_json" to use indoors/outdoors (IO) scene tags.
  • --places_c_json "path_to_c_json" to use 365 location categories (C) scene tags.
  • --places_feats_hdf5 "path_to_hdf5" to use scene features.

If you want to use the Swin-B model provided by us, you can type the following code:

# required packages
import timm
import torch
import torch.nn as nn

# predefined constants/arguments
ckpt_path = "./swin_base_model.pth" # checkpoints
target_size = 365 # of location categories

...

# get the timm model
model = timm.create_model(f"swin_base_patch4_window7_224_in22k", pretrained=True)

# get output layer and size of the previous one
last_layer = model.default_cfg['classifier']
num_ftrs = getattr(model, last_layer).in_features

# set the target size of the output layer
setattr(model, last_layer, nn.Linear(num_ftrs, target_size))
    
# set the checkpoints
model.load_state_dict(torch.load(ckpt_path))

To generate scene tags for Oscar/VinVL models, you need to save them in the json:

{
  "<img_id>": "<scene_tag>",
  ...
}

where "<img_id>" is the id of the image (this is the key that even the original Oscar/VinVL scripts follow); for the GQA dataset, the id is an image name without extension. The "<scene_tag>" is the location category - the first letter always begins with a capital letter, e.g., "Living room". You can combine both 365 location categories (C) and indoors/outdoors (IO) by typing them one after the other, e.g., "Hospital room Indoor" (C must be followed by the IO). The same applies to scene features stored in hdf5, only the value will be a 2054-long vector.

Possible issues

  • StopIteration: Caught StopIteration in replica 0 on device 0.

    Fix

    in ./Oscar/oscar/modeling/modeling_bert.py rewrite line 225 from:

    extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility

    to:

    extended_attention_mask = extended_attention_mask.to(dtype=torch.float32) # fp16 compatibility
    Full exception
    Traceback (most recent call last):
      File "run_gqa_places.py", line 1236, in <module>
        main()
      File "run_gqa_places.py", line 1154, in main
        global_step, tr_loss = train(args, train_dataset, eval_dataset, model, tokenizer)
      File "run_gqa_places.py", line 538, in train
        outputs = model(**inputs)
      File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
        outputs = self.parallel_apply(replicas, inputs, kwargs)
      File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
        return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
      File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
        output.reraise()
      File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/_utils.py", line 457, in reraise
        raise exception
    StopIteration: Caught StopIteration in replica 0 on device 0.
    Original Traceback (most recent call last):
      File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
        output = module(*input, **kwargs)
      File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "./Oscar/oscar/modeling/modeling_bert.py", line 328, in forward
        attention_mask=attention_mask, head_mask=head_mask, img_feats=img_feats)
      File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "./Oscar/oscar/modeling/modeling_bert.py", line 225, in forward
        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
    StopIteration
    

Citation

Please consider citing the following paper in case you use this code or provided data:

@inproceedings{vyskocil2023VinVL+L,
  title     = {VinVL+L: Enriching Visual Representation with Location Context in VQA},
  author    = {Vysko{\v{c}}il, Ji{\v{r}}{\'\i} and Picek, Luk{\'a}{\v{s}}},
  year      = {2023},
  booktitle = {Computer Vision Winter Workshop},
  series    = {{CEUR} Workshop Proceedings},
  month     = {February 15-17},
  address   = {Krems an der Donau, Austria},
}

License

VinVL+L is released under the MIT license. See LICENSE for details.

About

VinVL+L: Enriching Visual Representation with Location Context in Visual Question Answering (VQA)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages