Skip to content

generate & query embeddings from VTT files using openai & pinecone on Andrej Karpathy's's latest GPT tutorial

License

Notifications You must be signed in to change notification settings

gmchad/embed-vtt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Embed-VTT ✨

This repo uses openai embeddings and the pinecone vector database to generate and query embeddings from a VTT file.

The purpose of this repo was to implement semantic search to provide an extra resource in understanding Andrej Karpathy's latest video: Let's build GPT, however it is general enough to use for any transcript.

shoutout to miguel's yt-whisper library for helping with the youtube transcription. The data/ in this repo was generated using the small model.

Setup

Install

pip install -r requirements.txt

Environment

cp .env.sample .env

you'll need an API keys from openai & pinecone
OPENAI_KEY=***
PINECONE_KEY=***
PINECONE_ENVIRONMENT=*** (Optional)

Pinecone

Head over to pinecone and create an index with dimension 1536

Data

the data in this repo was generated from Let's build GPT using yt-whisper

  • /data/karpathy.vtt - contains the raw VTT file
  • /data/karpathy_embeddings.csv - contains the dataframe with the embeddings. you can use this file to directly seed your pinecone index

Usage

Generate Embeddings from VTT file

this will save an embedding csv file as {file_name}_embeddings.csv

python embed_vtt.py generate --vtt-file="data/karpathy.vtt"

Upload Embeddings from a CSV Embedding file

python embed_vtt.py upload --csv-embedding-file="data/karpathy_embeddings.csv"

Query Embeddings from text

python embed_vtt.py query --text="the usefulness of trill tensors"

sample output

0.81: But let me talk through it. It uses softmax. So trill here is this matrix, lower triangular ones. 00:54:52.240-00:55:01.440
0.81: but torches this function called trill, which is short for a triangular, something like that. 00:48:48.960-00:48:55.920
0.80: which is a very thin wrapper around basically a tensor of shape vocab size by vocab size. 00:23:17.920-00:23:23.280
0.79: I'm creating this trill variable. Trill is not a parameter of the module. So in sort of pytorch 01:19:36.880-01:19:42.160
0.79: does that. And I'm going to start to use the PyTorch library, and specifically the Torch.tensor 00:12:54.320-00:12:59.200

License

This script is open-source and licensed under the MIT License.

About

generate & query embeddings from VTT files using openai & pinecone on Andrej Karpathy's's latest GPT tutorial

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages