We describe a novel method - VinVL+L - that enriches the visual representations (i.e. object tags and region features) of the State-of-the-Art Vision and Language (VL) method - VinVL - with Location information. To verify the importance of such metadata for VL models, we:
-
trained a Swin-B model on the Places365 dataset and obtained additional sets of visual and tag features; both were made public to allow reproducibility and further experiments,
-
did an architectural update to the existing VinVL method to include the new feature sets,
-
provide a qualitative and quantitative evaluation.
By including just binary location metadata, the VinVL+L method provides incremental improvement to the State-of-the-Art VinVL in Visual Question Answering (VQA). The VinVL+L achieved an accuracy of 64.85% and increased the performance by +0.32% in terms of accuracy on the GQA dataset; the statistical significance of the new representations is verified via Approximate Randomization.
Swin-Base | AccuracyC | AccuracyIO | Scene tags | Other Links |
---|---|---|---|---|
Best accuracy | 56.0% | 96.1% | C | IO | model | scene features |
Best loss / Final epoch | 53.3% | 95.5% | C | IO | model | scene features |
Notes:
#1: C refers to 365 location categories, and IO to their indoors/outdoors supercategories.
#2: The listed results are on the Places365 validation
dataset; C → IO can be found
here.
Scene tags | Accuracy | Binary | Open | Links |
---|---|---|---|---|
C | 64.85% | 82.59% | 49.19% | model | server results |
C+IO | 64.71% | 82.38% | 49.12% | model | server results |
IO | 64.65% | 82.44% | 48.94% | model | server results |
— (reproduced VinVL) | 64.53% | 82.36% | 48.79% | model | server results |
Notes:
#1: The listed models do not use scene features. In the previous section are links to all features.
#2: The listed results are on the GQA test2019
dataset; an official leaderboard can be found
here.
Follow Oscar's instructions
for installation and download their data for GQA dataset
(Oscar/VinVL).
Additionally, you will need the Timm library in case of using our
Swin-B model, and h5py
if you want to use the scene features.
To run VinVL+L script, you can follow the original usage (see section GQA);
you must type python run_gqa_places.py
, then continue in using the same arguments and consider using the following
ones:
--wandb_entity "name"
in case you want to log your run in WandB, where"name"
refers to your profile name. Do not forget that you will need to login during the first run via console.--places_io_json "path_to_io_json"
to use indoors/outdoors (IO) scene tags.--places_c_json "path_to_c_json"
to use 365 location categories (C) scene tags.--places_feats_hdf5 "path_to_hdf5"
to use scene features.
If you want to use the Swin-B model provided by us, you can type the following code:
# required packages
import timm
import torch
import torch.nn as nn
# predefined constants/arguments
ckpt_path = "./swin_base_model.pth" # checkpoints
target_size = 365 # of location categories
...
# get the timm model
model = timm.create_model(f"swin_base_patch4_window7_224_in22k", pretrained=True)
# get output layer and size of the previous one
last_layer = model.default_cfg['classifier']
num_ftrs = getattr(model, last_layer).in_features
# set the target size of the output layer
setattr(model, last_layer, nn.Linear(num_ftrs, target_size))
# set the checkpoints
model.load_state_dict(torch.load(ckpt_path))
To generate scene tags for Oscar/VinVL models, you need to save them in the json:
{
"<img_id>": "<scene_tag>",
...
}
where "<img_id>"
is the id of the image (this is the key that even the original Oscar/VinVL scripts follow); for the
GQA dataset, the id is an image name without extension. The "<scene_tag>"
is the location category - the first letter
always begins with a capital letter, e.g., "Living room". You can combine both 365 location categories (C) and
indoors/outdoors (IO) by typing them one after the other, e.g., "Hospital room Indoor" (C must be followed by the IO).
The same applies to scene features stored in hdf5, only the value will be a 2054-long vector.
-
StopIteration: Caught StopIteration in replica 0 on device 0
.Fix
in
./Oscar/oscar/modeling/modeling_bert.py
rewrite line 225 from:extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
to:
extended_attention_mask = extended_attention_mask.to(dtype=torch.float32) # fp16 compatibility
Full exception
Traceback (most recent call last): File "run_gqa_places.py", line 1236, in <module> main() File "run_gqa_places.py", line 1154, in main global_step, tr_loss = train(args, train_dataset, eval_dataset, model, tokenizer) File "run_gqa_places.py", line 538, in train outputs = model(**inputs) File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/_utils.py", line 457, in reraise raise exception StopIteration: Caught StopIteration in replica 0 on device 0. Original Traceback (most recent call last): File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, **kwargs) File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "./Oscar/oscar/modeling/modeling_bert.py", line 328, in forward attention_mask=attention_mask, head_mask=head_mask, img_feats=img_feats) File "/storage/brno2/home/vyskocj/.conda/envs/VinVL-g/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "./Oscar/oscar/modeling/modeling_bert.py", line 225, in forward extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility StopIteration
Please consider citing the following paper in case you use this code or provided data:
@inproceedings{vyskocil2023VinVL+L,
title = {VinVL+L: Enriching Visual Representation with Location Context in VQA},
author = {Vysko{\v{c}}il, Ji{\v{r}}{\'\i} and Picek, Luk{\'a}{\v{s}}},
year = {2023},
booktitle = {Computer Vision Winter Workshop},
series = {{CEUR} Workshop Proceedings},
month = {February 15-17},
address = {Krems an der Donau, Austria},
}
VinVL+L is released under the MIT license. See LICENSE for details.