Skip to content

The official code of "CSTA: CNN-based Spatiotemporal Attention for Video Summarization"

License

Notifications You must be signed in to change notification settings

thswodnjs3/CSTA

Repository files navigation

CSTA: CNN-based Spatiotemporal Attention for Video Summarization (CVPR 2024 paper)

PWC
PWC

The official code of "CSTA: CNN-based Spatiotemporal Attention for Video Summarization"
image

Model overview

image

Requirements

Ubuntu GPU CUDA cuDNN conda python
20.04.6 LTS NVIDIA GeForce RTX 4090 12.1 8902 4.9.2 3.8.5
h5py numpy scipy torch torchvision tqdm
3.1.0 1.19.5 1.5.2 2.2.1 0.17.1 4.61.0
conda create -n CSTA python=3.8.5
conda activate CSTA
git clone https://github.com/thswodnjs3/CSTA.git
cd CSTA
pip install -r requirements.txt

Data

Link: Dataset
H5py format of two benchmark video summarization preprocessed datasets (SumMe, TVSum).
You should download datasets and put them in data/ directory.
The structure of the directory must be like below.

 ├── data
     └── eccv16_dataset_summe_google_pool5.h5
     └── eccv16_dataset_tvsum_google_pool5.h5

You can see the details of both datasets below.

SumMe
TVSum

Pre-trained models

Link: Weights
You can download our pre-trained weights of CSTA.
There are 5 weights for the SumMe dataset and the other 5 for the TVSum dataset(1 weight for each split).
As shown in the paper, we tested everything 10 times (without fixation of seed) but only uploaded a single model as a representative for your convenience.
The uploaded weight is acquired when the seed is 123456, and the result is almost identical to our paper.
You should put 5 weights of the SumMe in weights/SumMe and the other 5 weights of the TVSum in weights/TVSum.
The structure of the directory must be like below.

 ├── weights
     └── SumMe
         ├── split1.pt
         ├── split2.pt
         ├── split3.pt
         ├── split4.pt
         ├── split5.pt
     └── TVSum
         ├── split1.pt
         ├── split2.pt
         ├── split3.pt
         ├── split4.pt
         ├── split5.pt

Training

You can train the final version of our models by command below.

python train.py

Detailed explanations for all configurations will be updated later.

You can't reproduce our result perfectly.

As shown in the paper, we tested every experiment 10 times without fixation of the seed, so we can't be sure which seeds export the same results.
Even though you set the seed 123456, which is the same as our pre-trained models, it may result in different results due to the non-deterministic property of the Adaptive Average Pooling layer.
Based on my knowledge, non-deterministic operations produce random results even with the same seed. You can see details here.
However, you can get similar results with the pre-trained models when you set the seed as 123456, so I hope this will be helpful for you.

Inference

You can see the final performance of the models by command below.

python inference.py

All weight files should be located in the position I said above.

Citation

If you find our code or our paper useful, please click [★star] for this repo and [cite] the following paper:

@article{son2024csta,
  title={CSTA: CNN-based Spatiotemporal Attention for Video Summarization},
  author={Son, Jaewon and Park, Jaehun and Kim, Kwangsu},
  journal={arXiv preprint arXiv:2405.11905},
  year={2024}
}

Acknowledgement

We especially, sincerely appreciate the authors of PosENet, RR-STG who responded to our requests very kindly.
Below are the papers we referenced for the code.

A2Summ - paper, code
CA-SUM - paper, code
DSNet - paper, code
iPTNet - paper
MSVA - paper, code
PGL-SUM - paper, code
PosENet - paper, code
RR-STG - paper
SSPVS - paper, code
STVT - paper, code
VASNet - paper, code
VJMHT - paper, code

@inproceedings{he2023a2summ,
  title = {Align and Attend: Multimodal Summarization with Dual Contrastive Losses},
  author={He, Bo and Wang, Jun and Qiu, Jielin and Bui, Trung and Shrivastava, Abhinav and Wang, Zhaowen},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2023}
}
@inproceedings{10.1145/3512527.3531404,
  author = {Apostolidis, Evlampios and Balaouras, Georgios and Mezaris, Vasileios and Patras, Ioannis},
  title = {Summarizing Videos Using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames},
  year = {2022},
  isbn = {9781450392389},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3512527.3531404},
  doi = {10.1145/3512527.3531404},
  pages = {407-415},
  numpages = {9},
  keywords = {frame diversity, frame uniqueness, concentrated attention, unsupervised learning, video summarization},
  location = {Newark, NJ, USA},
  series = {ICMR '22}
}
@article{zhu2020dsnet,
  title={DSNet: A Flexible Detect-to-Summarize Network for Video Summarization},
  author={Zhu, Wencheng and Lu, Jiwen and Li, Jiahao and Zhou, Jie},
  journal={IEEE Transactions on Image Processing},
  volume={30},
  pages={948--962},
  year={2020}
}
@inproceedings{jiang2022joint,
  title={Joint video summarization and moment localization by cross-task sample transfer},
  author={Jiang, Hao and Mu, Yadong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={16388--16398},
  year={2022}
}
@article{ghauri2021MSVA, 
   title={SUPERVISED VIDEO SUMMARIZATION VIA MULTIPLE FEATURE SETS WITH PARALLEL ATTENTION},
   author={Ghauri, Junaid Ahmed and Hakimov, Sherzod and Ewerth, Ralph}, 
   Conference={IEEE International Conference on Multimedia and Expo (ICME)}, 
   year={2021} 
}
@INPROCEEDINGS{9666088,
    author    = {Apostolidis, Evlampios and Balaouras, Georgios and Mezaris, Vasileios and Patras, Ioannis},
    title     = {Combining Global and Local Attention with Positional Encoding for Video Summarization},
    booktitle = {2021 IEEE International Symposium on Multimedia (ISM)},
    month     = {December},
    year      = {2021},
    pages     = {226-234}
}
@InProceedings{islam2020position,
   title={How much Position Information Do Convolutional Neural Networks Encode?},
   author={Islam, Md Amirul and Jia, Sen and Bruce, Neil},
   booktitle={International Conference on Learning Representations},
   year={2020}
 }
@article{zhu2022relational,
  title={Relational reasoning over spatial-temporal graphs for video summarization},
  author={Zhu, Wencheng and Han, Yucheng and Lu, Jiwen and Zhou, Jie},
  journal={IEEE Transactions on Image Processing},
  volume={31},
  pages={3017--3031},
  year={2022},
  publisher={IEEE}
}
@inproceedings{li2023progressive,
  title={Progressive Video Summarization via Multimodal Self-supervised Learning},
  author={Li, Haopeng and Ke, Qiuhong and Gong, Mingming and Drummond, Tom},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  pages={5584--5593},
  year={2023}
}
@article{hsu2023video,
  title={Video summarization with spatiotemporal vision transformer},
  author={Hsu, Tzu-Chun and Liao, Yi-Sheng and Huang, Chun-Rong},
  journal={IEEE Transactions on Image Processing},
  year={2023},
  publisher={IEEE}
}
@misc{fajtl2018summarizing,
    title={Summarizing Videos with Attention},
    author={Jiri Fajtl and Hajar Sadeghi Sokeh and Vasileios Argyriou and Dorothy Monekosso and Paolo Remagnino},
    year={2018},
    eprint={1812.01969},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
@article{li2022video,
  title={Video Joint Modelling Based on Hierarchical Transformer for Co-summarization},
  author={Li, Haopeng and Ke, Qiuhong and Gong, Mingming and Zhang, Rui},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2022},
  publisher={IEEE}
}