Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community[minor]: Add indexing via locality sensitive hashing to the Yellowbrick vector store #20856

Merged
merged 81 commits into from
May 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
acf13fc
Add files via upload
markcusack Nov 24, 2023
7a77ca1
Add files via upload
markcusack Nov 24, 2023
88e9e15
Update __init__.py
markcusack Nov 24, 2023
7adfcc8
Moved to correct folder
markcusack Nov 24, 2023
1fa4cac
Fixed test
Nov 24, 2023
1ce8384
Fixed formatting
Nov 28, 2023
dbe8ef9
Fixed lint issues
Nov 29, 2023
a5c0a00
Fixed test to include FQ name of fake_embeddings location
Dec 3, 2023
24d8692
Added notebook example for the Yellowbrick vector store integration
markcusack Dec 9, 2023
217daf7
Fixed URLs to document sources
markcusack Dec 10, 2023
62a5c96
Fixed lint issues
markcusack Dec 10, 2023
d62a39f
Fixed import order issues
markcusack Dec 10, 2023
25b84ce
Resolved merge conflict
markcusack Dec 12, 2023
d531b61
Merge branch 'langchain-ai:master' into master
markcusack Jan 2, 2024
8ab1f67
Merge branch 'langchain-ai:master' into master
markcusack Jan 8, 2024
23698f6
Merge branch 'langchain-ai:master' into master
markcusack Jan 13, 2024
9ff6d78
Merge branch 'langchain-ai:master' into master
markcusack Feb 4, 2024
5122861
Merge branch 'langchain-ai:master' into master
markcusack Mar 3, 2024
34ac721
Update check_diff.py
markcusack Mar 3, 2024
d144997
Update check_diff.py
markcusack Mar 3, 2024
33de4e8
Merge branch 'langchain-ai:master' into master
markcusack Mar 4, 2024
1bcd790
Merge branch 'langchain-ai:master' into master
markcusack Mar 5, 2024
0af476d
Merge branch 'langchain-ai:master' into master
markcusack Mar 7, 2024
85722ca
Merge branch 'langchain-ai:master' into master
markcusack Mar 8, 2024
6d190e6
Merge branch 'langchain-ai:master' into master
markcusack Mar 11, 2024
ee45b25
Merge branch 'langchain-ai:master' into master
markcusack Mar 12, 2024
7d4a580
Merge branch 'langchain-ai:master' into master
markcusack Mar 13, 2024
4f54d13
Merge branch 'langchain-ai:master' into master
markcusack Mar 19, 2024
a9942ab
Merge branch 'langchain-ai:master' into master
markcusack Mar 22, 2024
afdaedc
Merge branch 'langchain-ai:master' into master
markcusack Mar 23, 2024
a04b5b5
Merge branch 'langchain-ai:master' into master
markcusack Mar 25, 2024
32c5980
Merge branch 'langchain-ai:master' into master
markcusack Apr 8, 2024
772d4f8
Merge branch 'langchain-ai:master' into master
markcusack Apr 23, 2024
8171347
Merge branch 'langchain-ai:master' into master
markcusack Apr 24, 2024
bcc4b7d
Merge branch 'langchain-ai:master' into master
markcusack Apr 24, 2024
afd984a
Add indexing implemented via locality-sensitive hashing
markcusack Apr 24, 2024
edbdba5
Fix integration test
markcusack Apr 24, 2024
4c03167
Fix integration test format
markcusack Apr 24, 2024
72e49cf
Merge branch 'master' into master
markcusack Apr 24, 2024
761e97e
Merge branch 'master' into master
markcusack Apr 24, 2024
4675c6d
Merge branch 'langchain-ai:master' into master
markcusack Apr 25, 2024
21b848f
Fix formatting for Yellowbrick Jupyter notebook
markcusack Apr 24, 2024
1602495
Generalize indexing configuration
markcusack Apr 25, 2024
532d6ab
Fix static typing errors
markcusack Apr 25, 2024
efd6cd6
Merge branch 'master' into master
markcusack Apr 25, 2024
7707837
Add context manager for get_cursor()
markcusack Apr 25, 2024
66c0b81
Add ability to pass in an existing DB connection
markcusack Apr 25, 2024
b6bc102
Merge branch 'master' into master
markcusack Apr 25, 2024
f04c7a2
Add schema migration support
markcusack Apr 25, 2024
867879d
Merge branch 'master' into master
markcusack Apr 25, 2024
53b4468
Merge branch 'master' into master
markcusack Apr 26, 2024
4e6fde8
Merge branch 'master' into master
markcusack Apr 26, 2024
1fa3c7d
Refactor connection and cursor handling
markcusack Apr 26, 2024
c7a14e6
Merge branch 'langchain-ai:master' into master
markcusack Apr 26, 2024
dbdf0b6
Merge branch 'master' into master
markcusack Apr 27, 2024
26a641b
Improve transaction handling, add schema support add delete support
markcusack Apr 27, 2024
7be468f
Add Yellowbrick as an index-compatible vector store
markcusack Apr 27, 2024
7c081f0
Add Yellowbrick as an index-compatible vector store
markcusack Apr 27, 2024
bf90b78
Fix notebook and add type hints
markcusack Apr 27, 2024
981f206
Fix formatting
markcusack Apr 27, 2024
82471ab
Add new tests and refactored delete function
markcusack Apr 27, 2024
025c7c6
Merge branch 'langchain-ai:master' into master
markcusack Apr 27, 2024
aa0e892
Merge branch 'master' into master
markcusack Apr 27, 2024
23ae39b
Fix integration test
markcusack Apr 27, 2024
846435c
Merge branch 'master' into master
markcusack Apr 29, 2024
5f78583
Merge upstream and fix conflict
markcusack Apr 30, 2024
6692acf
Merge branch 'master' into master
markcusack Apr 30, 2024
68a4416
Resolve conflict and refactor delete
markcusack May 2, 2024
67e6e46
Merge branch 'master' into master
markcusack May 2, 2024
f0be720
Merge branch 'langchain-ai:master' into master
markcusack May 2, 2024
64b585d
Remove breaking API change
markcusack May 2, 2024
9a271c9
Merge branch 'master' into master
markcusack May 2, 2024
9c50e4e
Merge branch 'langchain-ai:master' into master
markcusack May 3, 2024
a0f5d88
Modify delete behavior to accept delete_all as an option, and impleme…
markcusack May 3, 2024
b9c6973
Merge branch 'master' into master
markcusack May 3, 2024
cc3da18
Remove seed argument. Currently a noop on Yellowbrick
markcusack May 3, 2024
83a598c
Merge branch 'langchain-ai:master' into master
markcusack May 3, 2024
4edb24c
Simplified temp table handling, removing explicit commit and drops an…
markcusack May 3, 2024
d302164
Merge branch 'master' into master
markcusack May 3, 2024
fbe318a
Merge branch 'master' into master
eyurtsev May 6, 2024
763d0ba
Merge branch 'master' into master
markcusack May 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
104 changes: 93 additions & 11 deletions docs/docs/integrations/vectorstores/yellowbrick.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@
"import psycopg2\n",
"from IPython.display import Markdown, display\n",
"from langchain.chains import LLMChain, RetrievalQAWithSourcesChain\n",
"from langchain_community.docstore.document import Document\n",
"from langchain.schema import Document\n",
"from langchain_community.vectorstores import Yellowbrick\n",
"from langchain_openai import ChatOpenAI, OpenAIEmbeddings\n",
"from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
Expand Down Expand Up @@ -209,14 +209,12 @@
"\n",
"# Define the SQL statement to create a table\n",
"create_table_query = f\"\"\"\n",
"CREATE TABLE if not exists {embedding_table} (\n",
" id uuid,\n",
" embedding_id integer,\n",
" text character varying(60000),\n",
" metadata character varying(1024),\n",
" embedding double precision\n",
"CREATE TABLE IF NOT EXISTS {embedding_table} (\n",
" doc_id uuid NOT NULL,\n",
" embedding_id smallint NOT NULL,\n",
" embedding double precision NOT NULL\n",
")\n",
"DISTRIBUTE ON (id);\n",
"DISTRIBUTE ON (doc_id);\n",
"truncate table {embedding_table};\n",
"\"\"\"\n",
"\n",
Expand Down Expand Up @@ -257,6 +255,8 @@
" f\"postgres://{urlparse.quote(YBUSER)}:{YBPASSWORD}@{YBHOST}:5432/{YB_DOC_DATABASE}\"\n",
")\n",
"\n",
"print(yellowbrick_doc_connection_string)\n",
"\n",
"# Establish a connection to the Yellowbrick database\n",
"conn = psycopg2.connect(yellowbrick_doc_connection_string)\n",
"\n",
Expand Down Expand Up @@ -324,7 +324,7 @@
"vector_store = Yellowbrick.from_documents(\n",
" documents=split_docs,\n",
" embedding=embeddings,\n",
" connection_string=yellowbrick_connection_string,\n",
" connection_info=yellowbrick_connection_string,\n",
" table=embedding_table,\n",
")\n",
"\n",
Expand Down Expand Up @@ -403,6 +403,88 @@
"print_result_sources(\"Whats an easy way to add users in bulk to Yellowbrick?\")"
]
},
{
"cell_type": "markdown",
"id": "1f39fd30",
"metadata": {},
"source": [
"## Part 6: Introducing an Index to Increase Performance\n",
"\n",
"Yellowbrick also supports indexing using the Locality-Sensitive Hashing approach. This is an approximate nearest-neighbor search technique, and allows one to trade off similarity search time at the expense of accuracy. The index introduces two new tunable parameters:\n",
"\n",
"- The number of hyperplanes, which is provided as an argument to `create_lsh_index(num_hyperplanes)`. The more documents, the more hyperplanes are needed. LSH is a form of dimensionality reduction. The original embeddings are transformed into lower dimensional vectors where the number of components is the same as the number of hyperplanes.\n",
"- The Hamming distance, an integer representing the breadth of the search. Smaller Hamming distances result in faster retreival but lower accuracy.\n",
"\n",
"Here's how you can create an index on the embeddings we loaded into Yellowbrick. We'll also re-run the previous chat session, but this time the retrieval will use the index. Note that for such a small number of documents, you won't see the benefit of indexing in terms of performance."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "02ba61c4",
"metadata": {},
"outputs": [],
"source": [
"system_template = \"\"\"Use the following pieces of context to answer the users question.\n",
"Take note of the sources and include them in the answer in the format: \"SOURCES: source1 source2\", use \"SOURCES\" in capital letters regardless of the number of sources.\n",
"If you don't know the answer, just say that \"I don't know\", don't try to make up an answer.\n",
"----------------\n",
"{summaries}\"\"\"\n",
"messages = [\n",
" SystemMessagePromptTemplate.from_template(system_template),\n",
" HumanMessagePromptTemplate.from_template(\"{question}\"),\n",
"]\n",
"prompt = ChatPromptTemplate.from_messages(messages)\n",
"\n",
"vector_store = Yellowbrick(\n",
" OpenAIEmbeddings(),\n",
" yellowbrick_connection_string,\n",
" embedding_table, # Change the table name to reflect your embeddings\n",
")\n",
"\n",
"lsh_params = Yellowbrick.IndexParams(\n",
" Yellowbrick.IndexType.LSH, {\"num_hyperplanes\": 8, \"hamming_distance\": 2}\n",
")\n",
"vector_store.create_index(lsh_params)\n",
"\n",
"chain_type_kwargs = {\"prompt\": prompt}\n",
"llm = ChatOpenAI(\n",
" model_name=\"gpt-3.5-turbo\", # Modify model_name if you have access to GPT-4\n",
" temperature=0,\n",
" max_tokens=256,\n",
")\n",
"chain = RetrievalQAWithSourcesChain.from_chain_type(\n",
" llm=llm,\n",
" chain_type=\"stuff\",\n",
" retriever=vector_store.as_retriever(\n",
" k=5, search_kwargs={\"index_params\": lsh_params}\n",
" ),\n",
" return_source_documents=True,\n",
" chain_type_kwargs=chain_type_kwargs,\n",
")\n",
"\n",
"\n",
"def print_result_sources(query):\n",
" result = chain(query)\n",
" output_text = f\"\"\"### Question: \n",
" {query}\n",
" ### Answer: \n",
" {result['answer']}\n",
" ### Sources: \n",
" {result['sources']}\n",
" ### All relevant sources:\n",
" {', '.join(list(set([doc.metadata['source'] for doc in result['source_documents']])))}\n",
" \"\"\"\n",
" display(Markdown(output_text))\n",
"\n",
"\n",
"# Use the chain to query\n",
"\n",
"print_result_sources(\"How many databases can be in a Yellowbrick Instance?\")\n",
"\n",
"print_result_sources(\"Whats an easy way to add users in bulk to Yellowbrick?\")"
]
},
{
"cell_type": "markdown",
"id": "697c8a38",
Expand All @@ -418,9 +500,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "langchain_venv",
"display_name": "Python 3",
"language": "python",
"name": "langchain_venv"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/modules/data_connection/indexing.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@
" * document addition by id (`add_documents` method with `ids` argument)\n",
" * delete by id (`delete` method with `ids` argument)\n",
"\n",
"Compatible Vectorstores: `AnalyticDB`, `AstraDB`, `AzureCosmosDBVectorSearch`, `AzureSearch`, `AwaDB`, `Bagel`, `Cassandra`, `Chroma`, `CouchbaseVectorStore`, `DashVector`, `DatabricksVectorSearch`, `DeepLake`, `Dingo`, `ElasticVectorSearch`, `ElasticsearchStore`, `FAISS`, `HanaDB`, `LanceDB`, `Milvus`, `MyScale`, `OpenSearchVectorSearch`, `PGVector`, `Pinecone`, `Qdrant`, `Redis`, `Rockset`, `ScaNN`, `SupabaseVectorStore`, `SurrealDBStore`, `TimescaleVector`, `UpstashVectorStore`, `Vald`, `VDMS`, `Vearch`, `VespaStore`, `Weaviate`, `ZepVectorStore`, `TencentVectorDB`, `OpenSearchVectorSearch`.\n",
"Compatible Vectorstores: `AnalyticDB`, `AstraDB`, `AzureCosmosDBVectorSearch`, `AzureSearch`, `AwaDB`, `Bagel`, `Cassandra`, `Chroma`, `CouchbaseVectorStore`, `DashVector`, `DatabricksVectorSearch`, `DeepLake`, `Dingo`, `ElasticVectorSearch`, `ElasticsearchStore`, `FAISS`, `HanaDB`, `LanceDB`, `Milvus`, `MyScale`, `OpenSearchVectorSearch`, `PGVector`, `Pinecone`, `Qdrant`, `Redis`, `Rockset`, `ScaNN`, `SupabaseVectorStore`, `SurrealDBStore`, `TimescaleVector`, `UpstashVectorStore`, `Vald`, `VDMS`, `Vearch`, `VespaStore`, `Weaviate`, `ZepVectorStore`, `TencentVectorDB`, `OpenSearchVectorSearch`, `Yellowbrick`.\n",
" \n",
"## Caution\n",
"\n",
Expand Down