GitHub - ardasisbot/embedding-search-pg: Embedding-based similarity search on all Paul Graham's articles

Embedding-based similarity search on Paul Graham's essays.

pg_embedding_recording_480.mov

Repo contains: A python script to scrape all PG's essays, split/tokenize and put chunks into a Supabase DB, create embeddings via LangChain/OpenAI, pgvector to compare embeddings, a Supabase Edge function to serve similar documents to given query and a Next.js frontend.

That's a mouthful - here are the steps below.

Scraping

misc/scraper/pg_scrape.py

Pull all article links from here
For each article, scrape the content
Within each article, split the body into chunks and tokenize them via LangChain SpacyTextSplitter (worked best among the options)
For each chunk, compute OpenAI embeddings
Separately, go to Supabase and follow misc/scraper/db.sql to create db schema
Store all essay metadata & embeddings in a Supabase DB with pgvector enabled

API: Search Query -> Relevant Documents

misc/supabase/functions/embedsearch/index.ts

This is where we use a Supabase Edge Function (primarily for latency reasons)
Given a search query, we compute its embedding (via OpenAI) and call our Supabase DB for similarity search.

Frontend

Next.js frontend with Tailwindcss for styling.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.vscode		.vscode
components		components
misc		misc
pages		pages
public		public
styles		styles
utils		utils
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.npmignore		.npmignore
.vercelignore		.vercelignore
README.md		README.md
jsconfig.json		jsconfig.json
next.config.js		next.config.js
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
tailwind.config.js		tailwind.config.js
tsconfig.json		tsconfig.json

ardasisbot/embedding-search-pg

Folders and files

Latest commit

History

Repository files navigation

Scraping

API: Search Query -> Relevant Documents

Frontend

About

Topics

Resources

Stars

Watchers

Forks

Languages