State-aware Video Procedural Captioning

PyTorch code and dataset for our ACM MM 2021 paper "State-aware Video Procedural Captioning" by Taichi Nishimura, Atsushi Hashimoto, Yoshitaka Ushiku, Hirotaka Kameko, and Shinsuke Mori.

Video procedural captioning (VPC), which generates procedural text from instructional videos, is an essential task for scene understanding and real-world applications. The main challenge of VPC is to describe how to manipulate materials accurately. This paper focuses on this challenge by designing a new VPC task, generating a procedural text from the clip sequence of an instructional video and material list. In this task, the state of materials is sequentially changed by manipulations, yielding their state-aware visual representations (e.g., eggs are transformed into cracked, stirred, then fried forms). The essential difficulty is to convert such visual representations into textual representations; that is, a model should track the material states after manipulations to better associate the cross-modal relations. To achieve this, we propose a novel VPC method, which modifies an existing textual simulator for tracking material states as a visual simulator and incorporates it into a video captioning model. Our experimental results show the effectiveness of the proposed method, which outperforms state-of-the-art video captioning models. We further analyze the learned embedding of materials to demonstrate that the simulators capture their state transition.

Getting started

Prerequisites

Clone this repository

git clone https://github.com/misogil0116/svpc
cd svpc

Prepare feature files

Download features.tar.gz from Google drive. The features/ directory stores ResNet + BN-Inception features for each video.

features
├── testing
├── training
├── validation
└── yc2

Training and Inference

We give examples on how to perform training and inference.

Training The general training command is:

bash scripts/train.sh MODEL_TYPE TEMP_PARAM, LAMBDA_PARAM, CHECKPOINT_DIR, FEATURE_DIR, DURATION_PATH

MODEL_TYPE can be one of [vivt, viv, vi, v], see details below. TEMP_PARAM and LAMBDA_PARAM is a gumbel softmax temperature parameter and lambda parameter, respectively (TEMP_PARAM=0.5 and LAMBDA_PARAM=0.5 work well in our experiments). CHECKPOINT_DIR, FEATURE_DIR, and DURATION_DIR is checkpoint directory, feature directory, and duration csv filepath, respectively.

MODEL_TYPE	Description
vivt	+Visual simulator+Textual re-simulator
viv	+Visual simulator
vi	Video+Ingredient
v	Video

To train VIVT model:

scripts/train.sh vivt 0.5 0.5 /path/to/model/checkpoint/ /path/to/features/ /path/to/duration_frame.csv

Evaluate trained model on word-overlap evaluation (BLEU, METEOR, CIDEr-D, and ROUGE-L)

scripts/eval_caption.sh MODEL_TYPE CHECKPOINT_PATH FEATURE_DIR DURATION_PATH

Note that you should specify checkpoint file (.chkpt) for CHECKPOINT_PATH. Generated captions are saved at /path/to/model/checkpoint/MODEL_TYPE_test_greedy_pred_test.json. This file is used for ingredient prediction evaluation.

Evaluate ingredient prediction

scripts/eval_ingredient_f1.sh MODEL_TYPE CAPTION_PATH

The results should be comparable with the results shown at Table 4 of the paper.

Dump the learned embedding of ingredients

scripts/dump_embeddings.sh MODEL_TYPE CHECKPOINT_PATH FEATURE_DIR DURATION_PATH

This script generates ./MODEL_TYPE_step_embedding_dict.pkl, which consists of material embedding at each step.

Pretrained weights

You can download them from here

Questions

How to evaluate retrieval evaluation?

You can evaluate this by converting generated caption file (CHECKPOINT_PATH) into csv format that MIL-NCE requests. See here for additional information.

How to access annotated ingredients?

You can access them here. The annotated ingredients are stored to the json files (see ingredients keys).

Citation

If you use this code for your research, please cite our paper:

@inproceedings{taichi2021acmmm,
  title={State-aware Video Procedural Captioning},
  author={Taichi Nishimura and Atsushi Hashimoto and Yoshitaka Ushiku and Hirotaka Kameko and Shinsuke Mori},
  booktitle={ACMMM},
  pages={1766--1774},
  year={2021}
}

Code base

This code is based on MART

Contact

taichitary [at] gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
cache		cache
densevid_eval		densevid_eval
libs/ASL		libs/ASL
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bosselut_verb_vocab.json		bosselut_verb_vocab.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

State-aware Video Procedural Captioning

Getting started

Prerequisites

Training and Inference

Pretrained weights

Questions

Citation

Code base

Contact

About

Releases

Packages

Languages

License

awkrail/svpc

Folders and files

Latest commit

History

Repository files navigation

State-aware Video Procedural Captioning

Getting started

Prerequisites

Training and Inference

Pretrained weights

Questions

Citation

Code base

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages