Skip to content

Commit

Permalink
community[minor]: Add indexing via locality sensitive hashing to the …
Browse files Browse the repository at this point in the history
…Yellowbrick vector store (langchain-ai#20856)

- **Description:** Add LSH-based indexing to the Yellowbrick vector
store module
- **Twitter handle:** @markcusack

---------

Co-authored-by: markcusack <[email protected]>
Co-authored-by: markcusack <[email protected]>
Co-authored-by: Eugene Yurtsev <[email protected]>
  • Loading branch information
4 people authored and dglogo committed May 8, 2024
1 parent 2e39a10 commit cb9bf4b
Show file tree
Hide file tree
Showing 5 changed files with 1,095 additions and 170 deletions.
104 changes: 93 additions & 11 deletions docs/docs/integrations/vectorstores/yellowbrick.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@
"import psycopg2\n",
"from IPython.display import Markdown, display\n",
"from langchain.chains import LLMChain, RetrievalQAWithSourcesChain\n",
"from langchain_community.docstore.document import Document\n",
"from langchain.schema import Document\n",
"from langchain_community.vectorstores import Yellowbrick\n",
"from langchain_openai import ChatOpenAI, OpenAIEmbeddings\n",
"from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
Expand Down Expand Up @@ -209,14 +209,12 @@
"\n",
"# Define the SQL statement to create a table\n",
"create_table_query = f\"\"\"\n",
"CREATE TABLE if not exists {embedding_table} (\n",
" id uuid,\n",
" embedding_id integer,\n",
" text character varying(60000),\n",
" metadata character varying(1024),\n",
" embedding double precision\n",
"CREATE TABLE IF NOT EXISTS {embedding_table} (\n",
" doc_id uuid NOT NULL,\n",
" embedding_id smallint NOT NULL,\n",
" embedding double precision NOT NULL\n",
")\n",
"DISTRIBUTE ON (id);\n",
"DISTRIBUTE ON (doc_id);\n",
"truncate table {embedding_table};\n",
"\"\"\"\n",
"\n",
Expand Down Expand Up @@ -257,6 +255,8 @@
" f\"postgres://{urlparse.quote(YBUSER)}:{YBPASSWORD}@{YBHOST}:5432/{YB_DOC_DATABASE}\"\n",
")\n",
"\n",
"print(yellowbrick_doc_connection_string)\n",
"\n",
"# Establish a connection to the Yellowbrick database\n",
"conn = psycopg2.connect(yellowbrick_doc_connection_string)\n",
"\n",
Expand Down Expand Up @@ -324,7 +324,7 @@
"vector_store = Yellowbrick.from_documents(\n",
" documents=split_docs,\n",
" embedding=embeddings,\n",
" connection_string=yellowbrick_connection_string,\n",
" connection_info=yellowbrick_connection_string,\n",
" table=embedding_table,\n",
")\n",
"\n",
Expand Down Expand Up @@ -403,6 +403,88 @@
"print_result_sources(\"Whats an easy way to add users in bulk to Yellowbrick?\")"
]
},
{
"cell_type": "markdown",
"id": "1f39fd30",
"metadata": {},
"source": [
"## Part 6: Introducing an Index to Increase Performance\n",
"\n",
"Yellowbrick also supports indexing using the Locality-Sensitive Hashing approach. This is an approximate nearest-neighbor search technique, and allows one to trade off similarity search time at the expense of accuracy. The index introduces two new tunable parameters:\n",
"\n",
"- The number of hyperplanes, which is provided as an argument to `create_lsh_index(num_hyperplanes)`. The more documents, the more hyperplanes are needed. LSH is a form of dimensionality reduction. The original embeddings are transformed into lower dimensional vectors where the number of components is the same as the number of hyperplanes.\n",
"- The Hamming distance, an integer representing the breadth of the search. Smaller Hamming distances result in faster retreival but lower accuracy.\n",
"\n",
"Here's how you can create an index on the embeddings we loaded into Yellowbrick. We'll also re-run the previous chat session, but this time the retrieval will use the index. Note that for such a small number of documents, you won't see the benefit of indexing in terms of performance."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "02ba61c4",
"metadata": {},
"outputs": [],
"source": [
"system_template = \"\"\"Use the following pieces of context to answer the users question.\n",
"Take note of the sources and include them in the answer in the format: \"SOURCES: source1 source2\", use \"SOURCES\" in capital letters regardless of the number of sources.\n",
"If you don't know the answer, just say that \"I don't know\", don't try to make up an answer.\n",
"----------------\n",
"{summaries}\"\"\"\n",
"messages = [\n",
" SystemMessagePromptTemplate.from_template(system_template),\n",
" HumanMessagePromptTemplate.from_template(\"{question}\"),\n",
"]\n",
"prompt = ChatPromptTemplate.from_messages(messages)\n",
"\n",
"vector_store = Yellowbrick(\n",
" OpenAIEmbeddings(),\n",
" yellowbrick_connection_string,\n",
" embedding_table, # Change the table name to reflect your embeddings\n",
")\n",
"\n",
"lsh_params = Yellowbrick.IndexParams(\n",
" Yellowbrick.IndexType.LSH, {\"num_hyperplanes\": 8, \"hamming_distance\": 2}\n",
")\n",
"vector_store.create_index(lsh_params)\n",
"\n",
"chain_type_kwargs = {\"prompt\": prompt}\n",
"llm = ChatOpenAI(\n",
" model_name=\"gpt-3.5-turbo\", # Modify model_name if you have access to GPT-4\n",
" temperature=0,\n",
" max_tokens=256,\n",
")\n",
"chain = RetrievalQAWithSourcesChain.from_chain_type(\n",
" llm=llm,\n",
" chain_type=\"stuff\",\n",
" retriever=vector_store.as_retriever(\n",
" k=5, search_kwargs={\"index_params\": lsh_params}\n",
" ),\n",
" return_source_documents=True,\n",
" chain_type_kwargs=chain_type_kwargs,\n",
")\n",
"\n",
"\n",
"def print_result_sources(query):\n",
" result = chain(query)\n",
" output_text = f\"\"\"### Question: \n",
" {query}\n",
" ### Answer: \n",
" {result['answer']}\n",
" ### Sources: \n",
" {result['sources']}\n",
" ### All relevant sources:\n",
" {', '.join(list(set([doc.metadata['source'] for doc in result['source_documents']])))}\n",
" \"\"\"\n",
" display(Markdown(output_text))\n",
"\n",
"\n",
"# Use the chain to query\n",
"\n",
"print_result_sources(\"How many databases can be in a Yellowbrick Instance?\")\n",
"\n",
"print_result_sources(\"Whats an easy way to add users in bulk to Yellowbrick?\")"
]
},
{
"cell_type": "markdown",
"id": "697c8a38",
Expand All @@ -418,9 +500,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "langchain_venv",
"display_name": "Python 3",
"language": "python",
"name": "langchain_venv"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/modules/data_connection/indexing.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@
" * document addition by id (`add_documents` method with `ids` argument)\n",
" * delete by id (`delete` method with `ids` argument)\n",
"\n",
"Compatible Vectorstores: `AnalyticDB`, `AstraDB`, `AzureCosmosDBVectorSearch`, `AzureSearch`, `AwaDB`, `Bagel`, `Cassandra`, `Chroma`, `CouchbaseVectorStore`, `DashVector`, `DatabricksVectorSearch`, `DeepLake`, `Dingo`, `ElasticVectorSearch`, `ElasticsearchStore`, `FAISS`, `HanaDB`, `LanceDB`, `Milvus`, `MyScale`, `OpenSearchVectorSearch`, `PGVector`, `Pinecone`, `Qdrant`, `Redis`, `Rockset`, `ScaNN`, `SupabaseVectorStore`, `SurrealDBStore`, `TimescaleVector`, `UpstashVectorStore`, `Vald`, `VDMS`, `Vearch`, `VespaStore`, `Weaviate`, `ZepVectorStore`, `TencentVectorDB`, `OpenSearchVectorSearch`.\n",
"Compatible Vectorstores: `AnalyticDB`, `AstraDB`, `AzureCosmosDBVectorSearch`, `AzureSearch`, `AwaDB`, `Bagel`, `Cassandra`, `Chroma`, `CouchbaseVectorStore`, `DashVector`, `DatabricksVectorSearch`, `DeepLake`, `Dingo`, `ElasticVectorSearch`, `ElasticsearchStore`, `FAISS`, `HanaDB`, `LanceDB`, `Milvus`, `MyScale`, `OpenSearchVectorSearch`, `PGVector`, `Pinecone`, `Qdrant`, `Redis`, `Rockset`, `ScaNN`, `SupabaseVectorStore`, `SurrealDBStore`, `TimescaleVector`, `UpstashVectorStore`, `Vald`, `VDMS`, `Vearch`, `VespaStore`, `Weaviate`, `ZepVectorStore`, `TencentVectorDB`, `OpenSearchVectorSearch`, `Yellowbrick`.\n",
" \n",
"## Caution\n",
"\n",
Expand Down

0 comments on commit cb9bf4b

Please sign in to comment.