Grounded Image Text Matching with Mismatched Relation Reasoning

This repository contains the official Python implementation for the ICCV 2023 paper 8957 Grounded Image Text Matching with Mismatched Relation Reasoning.

[project page] [paper] [supp] [preprint] [video]

Abstract

This paper introduces Grounded Image Text Matching with Mismatched Relation (GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding capabilities of transformer-based pre-trained models. GITM-MR requires a model to first determine if an expression describes an image, then localize referred objects or ground the mismatched parts of the text. We provide a benchmark for evaluating vision-language (VL) models on this task, with a focus on the challenging settings of limited training data and out-of-distribution sentence lengths. Our evaluation demonstrates that pre-trained VL models often lack data efficiency and length generalization ability. To address this, we propose the Relation-sensitive Correspondence Reasoning Network (RCRN), which incorporates relation-aware reasoning via bi-directional message propagation guided by language structure. Our RCRN can be interpreted as a modular program and delivers strong performance in terms of both length generalization and data efficiency.

GITM-MR Benchmark

We appreciate the contribution Ref-Reasoning [1] dataset, which our benchmark is constructed on. Explore our benchmark from the link. The structure and detail of the data directory is shown as follows:

└─data
    ├─counter        # The correspondence from the original expressions to mismatch ones.
    ├─expression     # Referring expression annotation files.
    ├─parse          # Parsed language scene graphs.
    ├─small          # Training subset annotations.
    ├─uniter         # UNITER checkpoints and BERT tokenizer.
    ├─vinvl_objects  # Detected boxes and features in h5 format.
    ├─word2token     # Word to UNITER token indices used in representation extraction.

The annotated images are GQA [2] images and can be downloaded from the official website, but our model doesn't necessitate the original images as input. Feel free to explore them based on your requirements.

Prerequisites and Installation

Our implementation is based on Detectron2 framework. You need to prepare the required packages and build the local Detectron2 from the repository. Refer to the Common Installation Issues section in the installation manual in Detectron2 might be helpful to debug the process.

Prerequisites

conda create -n gitm python=3.7
pip install -r requirements.txt

Installation
```
python setup.py build develop
```

Reproduce the RCRN Result

Download the model checkpoint from the link and put them into ckpt directory.

Download all the dataset into data directory. The expected directory structure should be similar to:

└─GITM-MR
    ├─data
    ├─ckpt
    ├─configs
    ├─detectron2
    ├─scripts
    ├─tools

Run the evaluation process by:

python tools/train_refdet.py --num-gpus $num_gpu --config-file configs/{RCRN_len16.yaml, RCRN_len11.yaml} --config configs/train-ng-base-1gpu.json --eval-only --resume OUTPUT_DIR $output_dir

Specify your paralleled GPU number in $num_gpu and the output directory in $output_dir. Refer to scripts directory for the example.

If necessary, refer to detectron2/modeling/refdet_heads/RCRN.py file to explore our model implementation.

References

[1] Sibei Yang, Guanbin Li, and Yizhou Yu. Graph-structured referring expression reasoning in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9952–9961, 2020.

[2] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.

Citing GITM-MR

If you find our work useful for your research, please consider citing us:

@InProceedings{Wu_2023_ICCV,
    author    = {Wu, Yu and Wei, Yana and Wang, Haozhe and Liu, Yongfei and Yang, Sibei and He, Xuming},
    title     = {Grounded Image Text Matching with Mismatched Relation Reasoning},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {2976-2987}
}

Contact

Please feel free to contact us at [email protected] or [email protected] if you have further questions or comments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

detectron2

detectron2

fig

fig

scripts

scripts

tools

tools

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

Grounded Image Text Matching with Mismatched Relation Reasoning

Abstract

GITM-MR Benchmark

Prerequisites and Installation

Reproduce the RCRN Result

References

Citing GITM-MR

Contact

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
detectron2		detectron2
fig		fig
scripts		scripts
tools		tools
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

SHTUPLUS/GITM-MR

Folders and files

Latest commit

History

Repository files navigation

Grounded Image Text Matching with Mismatched Relation Reasoning

Abstract

GITM-MR Benchmark

Prerequisites and Installation

Reproduce the RCRN Result

References

Citing GITM-MR

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages