Embedding-based similarity search on Paul Graham's essays.
pg_embedding_recording_480.mov
Repo contains: A python script to scrape all PG's essays, split/tokenize and put chunks into a Supabase DB, create embeddings via LangChain/OpenAI, pgvector to compare embeddings, a Supabase Edge function to serve similar documents to given query and a Next.js frontend.
That's a mouthful - here are the steps below.
misc/scraper/pg_scrape.py
- Pull all article links from here
- For each article, scrape the content
- Within each article, split the body into chunks and tokenize them via LangChain SpacyTextSplitter (worked best among the options)
- For each chunk, compute OpenAI embeddings
- Separately, go to Supabase and follow
misc/scraper/db.sql
to create db schema - Store all essay metadata & embeddings in a Supabase DB with pgvector enabled
misc/supabase/functions/embedsearch/index.ts
- This is where we use a Supabase Edge Function (primarily for latency reasons)
- Given a search query, we compute its embedding (via OpenAI) and call our Supabase DB for similarity search.
- Next.js frontend with Tailwindcss for styling.