🔍 Shotluck Holmes: A family of small-scale LLVMs for shot-level video understanding
-
Updated
May 25, 2024 - Python
🔍 Shotluck Holmes: A family of small-scale LLVMs for shot-level video understanding
A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models (CVPR 2023)
[NeurIPS 2022 Spotlight] Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
[ICCV 2023] Accurate and Fast Compressed Video Captioning
A curated list of video-text datasets in a variety of languages. These datasets can be used for video captioning (video description) or video retrieval.
[AAAI 2023 Oral] VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning
Convert SRT formatted subtitle to WebVTT on the fly over HTML5/browser environment
(TIP) Concept-Aware Video Captioning: Describing Videos with Effective Prior Information
Winner solution to Generic Event Boundary Captioning task in LOVEU Challenge (CVPR 2023 workshop)
Visio Text is a real-time video captioning project that leverages the capabilities of artificial intelligence to provide dynamic text captions for videos.
A PyTorch implementation of EmpiricalMVM
[NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale
Summary about Video-to-Text datasets. This repository is part of the review paper *Bridging Vision and Language from the Video-to-Text Perspective: A Comprehensive Review*
(PRCV'2022) CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter
Data collection and automatic labeling for dense video captioning models
[ECCV 2020] PyTorch code of MMT (a multimodal transformer captioning model) on TVCaption dataset
An encoder-decoder deep learning model (with/without attention mechanism) where the input is an arabic sign-language video and the output is its translation in text format.
MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian (Bahasa Indonesia).
Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Add a description, image, and links to the video-captioning topic page so that developers can more easily learn about it.
To associate your repository with the video-captioning topic, visit your repo's landing page and select "manage topics."