Inductive and Transductive Few-Shot Video Classification via Appearance and Temporal Alignments

by Khoi D. Nguyen, Quoc-Huy Tran, Khoi Nguyen, Binh-Son Hua, and Rang Nguyen

We present a novel method for few-shot video classification, which performs appearance and temporal alignments. In particular, given a pair of query and support videos, we conduct appearance alignment via frame-level feature matching to achieve the appearance similarity score between the videos, while utilizing temporal order-preserving priors for obtaining the temporal similarity score between the videos. Moreover, we leverage the above appearance and temporal similarity scores in prototypes refinement for both inductive and transductive settings. To the best of our knowledge, our work is the first to explore transductive few-shot video classification.

Details of our evaluation framework and benchmark results can be found in our paper:

@inproceedings{khoi2022ata,
    title={Inductive and Transductive Few-Shot Video Classification via Appearance and Temporal Alignments},
    author={Khoi D. Nguyen and Quoc-Huy Tran and Khoi Nguyen and Binh-Son Hua and Rang Nguyen},
    booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
    year={2022}
}

Please CITE our paper when this repository is used to help produce published results or is incorporated into other software.

Prerequisites

The code is built with following libraries:

python 3.6 or higher
PyTorch 1.0 or higher
torchvision 0.2 or higher
TensorboardX
tqdm
scikit-learn
opencv-python 4.1 or higher

Data Preparation

Check this for details of Something-Something V2 downloading.

For data preprocessing, we use vidtools as in TAM to extract frames of video.

The processing of video data can be summarized as follows:

Extract frames from videos.

Firstly, users need clone vidtools:

git clone https://github.com/liu-zhy/vidtools.git & cd vidtools

Extract frames by running:
```
python extract_frames.py VIDEO_PATH/ \
-o DATASET_PATH/frames/ \
-j 16 --out_ext png
```
We suggest users using --out_ext jpg with limited disk storage.

Generate the annotation by running ops/gen_label_sthv2.py.

The annotation includes train.txt, val.txt and test.txt. The format of *.txt file is like:

frames/video_1 num_frames label_1
frames/video_2 num_frames label_2
frames/video_3 num_frames label_3
...
frames/video_N num_frames label_N

The pre-processed dataset is organized with the following structure:

datasets
  |_ smsm
    |_ frames
    |  |_ [video_0]
    |  |  |_ img_00001.png
    |  |  |_ img_00002.png
    |  |  |_ ...
    |  |_ [video_1]
    |     |_ img_00001.png
    |     |_ img_00002.png
    |     |_ ...
    |_ annotations
       |_ train.txt
       |_ val.txt
       |_ test.txt

Configure the dataset in ops/dataset_config.py.

Training

To train on Something-Something V2 from ImageNet pretrained models, users can run scripts/train_somethingv2_rgb_8f.sh, which contains:

# train on Something-Something V2
python -u main.py somethingv2 RGB --arch resnet50 \
--num_segments 8 --lr 0.001 --lr_steps 10 20 --epochs 25  \
--batch-size 32 --workers 2 --dropout 0.5 \
--root_log ./checkpoints/path --root_model ./checkpoints/path \
--wd 0.0005 --gpus 0 --episodes 600

Training Arguments

num_segments: Number of frames per video, default to 8
lr: Initial learning rate, default to 0.001
lr_steps: Epochs to decay learning rate by 10, default to [10, 20]
batch-size: Mini-batch size, default to 128
workers: Number of workers, default to 8
epochs: Number of training epochs, default to 25
wd: Weight decay. default to 5e-4

Testing

The pretrained models are available here

To test the downloaded pretrained models on Something-Something V2, users can modify/run scripts/test_somethingv2_rgb_8f.sh. For example, to test 5-way/1-shot inductive settings on 10,000 episodes:

# test on Something-Something V2
python -u main.py somethingv2 RGB --arch resnet50 --num_segments 8 --workers 2 \
--root_log ./checkpoints/path --root_model ./checkpoints/path \
--resume ./checkpoints/path/ckpt.best.pth.tar --evaluate --gpus 0 --way 5 --shot 1 --episodes 10000

Few-shot Arguments

episodes: Number of test episodes, default to 10,000
way: Number of novel classes, default to 5
shot: Number of support samples of each novel class, default to 1
n_query: Number of support samples of each query class, default to 1
iter: Number of prototype refinement steps for each episode, default to 50
transductive: Whether to do perform transductive or inductive, default to False

Acknowledgment

We thank the following repos providing helpful components/functions in our work.

FEAT
TAM

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
assets		assets
ops		ops
scripts		scripts
LICENSE.md		LICENSE.md
README.md		README.md
inference.py		inference.py
main.py		main.py
opts.py		opts.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

ops

ops

scripts

scripts

LICENSE.md

LICENSE.md

README.md

README.md

inference.py

inference.py

main.py

main.py

opts.py

opts.py

train.py

train.py

Repository files navigation

Inductive and Transductive Few-Shot Video Classification via Appearance and Temporal Alignments

Content

Prerequisites

Data Preparation

Training

Testing

Acknowledgment

About

Languages

License

VinAIResearch/fsvc-ata

Folders and files

Latest commit

History

Repository files navigation

Inductive and Transductive Few-Shot Video Classification via Appearance and Temporal Alignments

Content

Prerequisites

Data Preparation

Training

Testing

Acknowledgment

About

Topics

Resources

License

Stars

Watchers

Forks

Languages