VinVL+L: Enriching Visual Representation with Location Context in VQA

Introduction

We describe a novel method - VinVL+L - that enriches the visual representations (i.e. object tags and region features) of the State-of-the-Art Vision and Language (VL) method - VinVL - with Location information. To verify the importance of such metadata for VL models, we:

trained a Swin-B model on the Places365 dataset and obtained additional sets of visual and tag features; both were made public to allow reproducibility and further experiments,
did an architectural update to the existing VinVL method to include the new feature sets,
provide a qualitative and quantitative evaluation.

By including just binary location metadata, the VinVL+L method provides incremental improvement to the State-of-the-Art VinVL in Visual Question Answering (VQA). The VinVL+L achieved an accuracy of 64.85% and increased the performance by +0.32% in terms of accuracy on the GQA dataset; the statistical significance of the new representations is verified via Approximate Randomization.

Download

Location features

Swin-Base	Accuracy_C	Accuracy_IO	Scene tags	Other Links
Best accuracy	56.0%	96.1%	C \| IO	model \| scene features
Best loss / Final epoch	53.3%	95.5%	C \| IO	model \| scene features

_{Notes:

#1: C refers to 365 location categories, and IO to their indoors/outdoors supercategories.

#2: The listed results are on the Places365 validation
dataset; C → IO can be found
here.}

VinVL+L

Scene tags	Accuracy	Binary	Open	Links
C	64.85%	82.59%	49.19%	model \| server results
C+IO	64.71%	82.38%	49.12%	model \| server results
IO	64.65%	82.44%	48.94%	model \| server results
— (reproduced VinVL)	64.53%	82.36%	48.79%	model \| server results

_{Notes:

#1: The listed models do not use scene features. In the previous section are links to all features.

#2: The listed results are on the GQA test2019
dataset; an official leaderboard can be found
here.}

Usage

Follow Oscar's instructions for installation and download their data for GQA dataset (Oscar/VinVL). Additionally, you will need the Timm library in case of using our Swin-B model, and h5py if you want to use the scene features.

To run VinVL+L script, you can follow the original usage (see section GQA); you must type python run_gqa_places.py, then continue in using the same arguments and consider using the following ones:

--wandb_entity "name" in case you want to log your run in WandB, where "name" refers to your profile name. Do not forget that you will need to login during the first run via console.
--places_io_json "path_to_io_json" to use indoors/outdoors (IO) scene tags.
--places_c_json "path_to_c_json" to use 365 location categories (C) scene tags.
--places_feats_hdf5 "path_to_hdf5" to use scene features.

If you want to use the Swin-B model provided by us, you can type the following code:

# required packages
import timm
import torch
import torch.nn as nn

# predefined constants/arguments
ckpt_path = "./swin_base_model.pth" # checkpoints
target_size = 365 # of location categories

...

# get the timm model
model = timm.create_model(f"swin_base_patch4_window7_224_in22k", pretrained=True)

# get output layer and size of the previous one
last_layer = model.default_cfg['classifier']
num_ftrs = getattr(model, last_layer).in_features

# set the target size of the output layer
setattr(model, last_layer, nn.Linear(num_ftrs, target_size))
    
# set the checkpoints
model.load_state_dict(torch.load(ckpt_path))

To generate scene tags for Oscar/VinVL models, you need to save them in the json:

{
  "<img_id>": "<scene_tag>",
  ...
}

where "<img_id>" is the id of the image (this is the key that even the original Oscar/VinVL scripts follow); for the GQA dataset, the id is an image name without extension. The "<scene_tag>" is the location category - the first letter always begins with a capital letter, e.g., "Living room". You can combine both 365 location categories (C) and indoors/outdoors (IO) by typing them one after the other, e.g., "Hospital room Indoor" (C must be followed by the IO). The same applies to scene features stored in hdf5, only the value will be a 2054-long vector.

Possible issues

StopIteration: Caught StopIteration in replica 0 on device 0.

Fix

in ./Oscar/oscar/modeling/modeling_bert.py rewrite line 225 from:

extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility

to:

extended_attention_mask = extended_attention_mask.to(dtype=torch.float32) # fp16 compatibility

Full exception

Traceback (most recent call last):
  File "run_gqa_places.py", line 1236, in <module>
    main()
  File "run_gqa_places.py", line 1154, in main
    global_step, tr_loss = train(args, train_dataset, eval_dataset, model, tokenizer)
  File "run_gqa_places.py", line 538, in train
    outputs = model(**inputs)
  File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/_utils.py", line 457, in reraise
    raise exception
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "./Oscar/oscar/modeling/modeling_bert.py", line 328, in forward
    attention_mask=attention_mask, head_mask=head_mask, img_feats=img_feats)
  File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "./Oscar/oscar/modeling/modeling_bert.py", line 225, in forward
    extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration

Citation

Please consider citing the following paper in case you use this code or provided data:

@inproceedings{vyskocil2023VinVL+L,
  title     = {VinVL+L: Enriching Visual Representation with Location Context in VQA},
  author    = {Vysko{\v{c}}il, Ji{\v{r}}{\'\i} and Picek, Luk{\'a}{\v{s}}},
  year      = {2023},
  booktitle = {Computer Vision Winter Workshop},
  series    = {{CEUR} Workshop Proceedings},
  month     = {February 15-17},
  address   = {Krems an der Donau, Austria},
}

License

VinVL+L is released under the MIT license. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Oscar @ 4788a74		Oscar @ 4788a74
docs		docs
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
run_gqa_places.py		run_gqa_places.py
significance.py		significance.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Oscar @ 4788a74

Oscar @ 4788a74

docs

docs

.gitmodules

.gitmodules

LICENSE

LICENSE

README.md

README.md

run_gqa_places.py

run_gqa_places.py

significance.py

significance.py

Repository files navigation

VinVL+L: Enriching Visual Representation with Location Context in VQA

Introduction

Download

Location features

VinVL+L

Usage

Possible issues

Citation

License

About

Releases

Packages

Languages

License

vyskocj/VinVL-L

Folders and files

Latest commit

History

Repository files navigation

VinVL+L: Enriching Visual Representation with Location Context in VQA

Introduction

Download

Location features

VinVL+L

Usage

Possible issues

Citation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages