[CVPR2024] Do you remember? Dense Video Captioning with Cross-Modal Memory Retrieval

This repo is a official codebase for our paper accepted to the CVPR2024. The aim of this repo is to help other researchers.

Paper

ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤ

Meet at CVPR2024 and Communicate with the author, Minkuk Kim!

ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤ ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤ

Introduction

There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video. Several studies introduce methods by designing dense video captioning as a multitasking problem of event localization and event captioning to consider inter-task relations. However, addressing both tasks using only visual input is challenging due to the lack of semantic content. In this study, we address this by proposing a novel framework called as CM2 inspired by the cognitive information processing of humans. Our model utilizes external memory to incorporate prior knowledge. The memory retrieval method is proposed with cross-modal video-to-text matching. To effectively incorporate retrieved text features, the versatile encoder and the decoder with visual and textual cross-attention modules are designed. Comparative experiments have been conducted to show the effectiveness of the proposed method on ActivityNet Captions and YouCook2 datasets. Experimental results show promising performance of our model without extensive pretraining from a large video dataset.

Preparation

Environment: Linux, Python>=3.8, PyTorch>=1.7.1

Create virtual environment by conda

conda  create  -n  cm2  python=3.8
source  activate  cm2
conda  install  pytorch==1.7.1  torchvision==0.8.2  torchaudio==0.7.2  cudatoolkit=11.0  -c  pytorch
conda  install  ffmpeg
pip  install  -r  requirement.txt
pip  install  git+https://github.com/openai/CLIP.git

Compile the deformable attention layer (requires GCC >= 5.4).

cd  CM2/ops
sh  make.sh

Prepare resources to run our code.

Data

Download anet clip feature (GoogleDrive). Then put it in data folder like 'CM2/data/anet/features/clipvitl14.pth'

Download yc2 clip feature (GoogleDrive). Then put it in data folder like 'CM2/data/yc2/features/clipvitl14.pth'

Pre-trained model

Download pre-trained model for anet (GoogleDrive). Then put it in data folder like 'CM2/save/anet_clip_cm2_best/model-best.pth'

Download pre-trained model for yc2 (GoogleDrive). Then put it in data folder like 'CM2/save/yc2_clip_cm2_best/model-best.pth'

Memory Bank

Download 3 memory files for anet (GoogleDrive). Then put it in data folder like 'CM2/bank/anet/clip/*'

Download 3 memory files for yc2 (GoogleDrive). Then put it in data folder like 'CM2/bank/yc2/clip/*'

Training CM2

Training ActivityNet Captions

cd  CM2
sh  train_anet.sh

Training YouCook2

cd  CM2
sh  train_yc2.sh

Evaluation CM2

Evaluate ActivityNet Captions

cd  CM2
sh  eval_anet.sh

Evaluate YouCook2

cd  CM2
sh  eval_yc2.sh

Citation

Acknowledgement

The implementation of Deformable Transformer is mainly based on Deformable DETR. The implementation of the captioning head is based on ImageCaptioning.pytorch. The implementation of the pipeline is mainly based on PDVC We thanks the authors for their efforts.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
cfgs		cfgs
cm2		cm2
data		data
densevid_eval3		densevid_eval3
misc		misc
save		save
CM2_fig.png		CM2_fig.png
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
eval_anet.sh		eval_anet.sh
eval_utils_clip.py		eval_utils_clip.py
eval_yc2.sh		eval_yc2.sh
memory_bank.py		memory_bank.py
opts.py		opts.py
requirement.txt		requirement.txt
ret_utils.py		ret_utils.py
train.py		train.py
train_anet.sh		train_anet.sh
train_yc2.sh		train_yc2.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[CVPR2024] Do you remember? Dense Video Captioning with Cross-Modal Memory Retrieval

Meet at CVPR2024 and Communicate with the author, Minkuk Kim!

Introduction

Preparation

Data

Pre-trained model

Memory Bank

Training CM2

Evaluation CM2

Citation

Acknowledgement

About

Releases

Packages

Languages

License

ailab-kyunghee/CM2_DVC

Folders and files

Latest commit

History

Repository files navigation

[CVPR2024] Do you remember? Dense Video Captioning with Cross-Modal Memory Retrieval

Meet at CVPR2024 and Communicate with the author, Minkuk Kim!

Introduction

Preparation

Data

Pre-trained model

Memory Bank

Training CM2

Evaluation CM2

Citation

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages