Skip to content

CLIFS (CLIP-based Frame Selection) is a Python function that takes in a video file and a text prompt as input, and uses the CLIP (Contrastive Language-Image Pre-training) model to find the frame in the video that is most similar to the given text prompt.

Notifications You must be signed in to change notification settings

ubaidkhan08/CLIFS-Contrastive-Language-Image-Forensic-Search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CLIFS-Contrastive-Language-Image-Forensic-Search

Overview

CLIFS (using OpenAI's CLIP model-based frame selection) is a Python function that takes in a video file and a text prompt as input, and uses the CLIP (Contrastive Language-Image Pre-training) model to find the frame in the video that is most similar to the given text prompt. The function extracts frames from the video, encodes each frame and the text prompt using the CLIP model, and computes the similarity between the encoded frames and text prompt. The function then selects the frame with the highest similarity score and saves it as an image file.

The CLIP model is a neural network architecture that is trained on a large dataset of images and their corresponding captions. It learns to associate the textual and visual information in the dataset by using a contrastive loss function that encourages the model to encode similar text and images close to each other in the embedding space. The CLIP model is based on a transformer architecture that uses self-attention to attend to different parts of the input text or image, allowing it to capture complex relationships between words and pixels.

In the CLIFS code, the ViT-B/32 variant of the model is used, which is a vision transformer with 12 layers and a patch size of 32x32. The model is loaded onto the CPU or GPU depending on availability, and its eval method is called to put it into evaluation mode.

Use-cases

CLIFS can be useful in a variety of video-related use cases where you need to find specific frames in a video that match a given text prompt. Some potential applications of CLIFS include:

1. Video surveillance: Investigators may need to search through hours of surveillance footage to find specific events or people. CLIFS can be used to quickly identify frames in the video that match a given text prompt, making it easier to find relevant footage.

2. Video editing: Video editors may need to find specific frames in a large library of footage to create a specific sequence or highlight reel. CLIFS can be used to quickly identify frames that match a given text prompt, making the editing process faster and more efficient.

3. Content moderation: Social media platforms may need to monitor user-generated content for inappropriate or harmful content. CLIFS can be used to quickly identify frames in a video that contain such content, allowing moderators to take appropriate action.

4. Video recommendation: Video recommendation systems may use CLIFS to suggest videos to users based on their interests. The system can analyze the video's frames and recommend videos that match a user's preferences.

Examples

Prompt: A tree with the text 'ZWK'

alt text

A few other outputs:

alt text

About

CLIFS (CLIP-based Frame Selection) is a Python function that takes in a video file and a text prompt as input, and uses the CLIP (Contrastive Language-Image Pre-training) model to find the frame in the video that is most similar to the given text prompt.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published