VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling

[2023/03/09 Update] VIOLETv2

We have released our empirical study of masked visual modeling for VidL learning as VIOLETv2.

VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling

A PyTorch implementation of VIOLET

Overview

VIOLET is an implementation of
"VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling"
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu

VIOLET contains 3 components: Video Swin Transformer (VT) computes video features; Language Embedder (LE) extracts word embeddings; Cross-modal Transformer (CT) performs cross-modal fusion. To benefit from large-scale data, we incorporate 3 pretraining tasks: Masked Language Modeling (MVM) predicts the masked word tokens; Masked Visual-token Modeling (MVM) recovers the masked video patches; Visual-Text Matching (VTM) learns the alignments between video and text modality.

Requirements

This code is implemented under Python 3.8, PyTorch 1.7, and Torchvision 0.8.

Usage

Data preprocessing

As using outer datasets (cannot be shared by us), we provide preprocessing tools to extract sparse-sampled video frames into our compressed format.

cd _tools

# We use 4 frames during pretraining and 5 frames for downstream tasks
python extract_video-frame.py --path=msrvtt --sample=5 # output: msrvtt.pkl

# We use DALL-E to extract VQ tokens for MVM pretraining
wget https://cdn.openai.com/dall-e/encoder.pkl # download trained dall-e encoder
python extract_vq.py --path=msrvtt --frame=224 # output: msrvtt_vq.pkl

# We adopt file.seek() instead of loading entire data to reduce the memory cost during distributed pretraining
python extract_tsv.py --path=msrvtt # output: msrvtt.tsv, msrvtt.lineidx

There are partial examples (WebVid2.5M, CC3M, TGIF-Action, MSVD-QA, and MSRVTT-Retrieval) to help formulate the input data.

Pretraining

Put pretrained VT in ./_snapshot. This script pretrains on both video (WebVid2.5M) and image (CC3M) data via single-node multi-gpu distributed training.

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=7122 main_pretrain.py

We have our used datasets and the best pretrained checkpoint (YT180M+WebVid2.5M+CC3M).

Downstream

Multiple-Choice Question Answering (TGIF-Action, TGIF-Transition, MSRVTT-MC, and LSMDC-MC)

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qamc.py _data/args_tgif-action.json

Open-Ended Question Answering (TGIF-Frame, MSRVTT-QA, LSMDC-FiB, and MSVD-QA)

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qaoe.py _data/args_msvd-qa.json

Text-to-Video Retrieval (MSRVTT, DiDeMo, YouCook2, and LSMDC)

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_retrieval.py _data/args_msrvtt-retrieval.json
CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval.py _data/args_msrvtt-retrieval.json

We also provide all downstream datasets and trained checkpoints.

Citation

@inproceedings{fu2023empirical-mvm, 
  author = {Tsu-Jui Fu* and Linjie Li* and Zhe Gan and Kevin Lin and William Yang Wang and Lijuan Wang and Zicheng Liu}, 
  title = {{An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling}}, 
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)}, 
  year = {2023} 
}

@inproceedings{fu2021violet, 
  author = {Tsu-Jui Fu and Linjie Li and Zhe Gan and Kevin Lin and William Yang Wang and Lijuan Wang and Zicheng Liu}, 
  title = {{VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling}}, 
  booktitle = {arXiv:2111.1268}, 
  year = {2021} 
}

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
_data		_data
_imgs		_imgs
_snapshot		_snapshot
_tools		_tools
.gitignore		.gitignore
README.md		README.md
agent.py		agent.py
dataset.py		dataset.py
eval_retrieval.py		eval_retrieval.py
lib.py		lib.py
main_pretrain.py		main_pretrain.py
main_qamc.py		main_qamc.py
main_qaoe.py		main_qaoe.py
main_retrieval.py		main_retrieval.py
model.py		model.py
video_swin.py		video_swin.py

tsujuifu/pytorch_violet

Folders and files

Latest commit

History

Repository files navigation

[2023/03/09 Update] VIOLETv2

VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling

Overview

Requirements

Usage

Data preprocessing

Pretraining

Downstream

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages