community[minor]: Add indexing via locality sensitive hashing to the …

…Yellowbrick vector store (langchain-ai#20856) - **Description:** Add LSH-based indexing to the Yellowbrick vector store module - **Twitter handle:** @markcusack --------- Co-authored-by: markcusack <[email protected]> Co-authored-by: markcusack <[email protected]> Co-authored-by: Eugene Yurtsev <[email protected]>
dglogo · May 8, 2024 · cb9bf4b · cb9bf4b
1 parent 2e39a10
commit cb9bf4b
Show file tree

Hide file tree

Showing 5 changed files with 1,095 additions and 170 deletions.
diff --git a/docs/docs/integrations/vectorstores/yellowbrick.ipynb b/docs/docs/integrations/vectorstores/yellowbrick.ipynb
@@ -98,7 +98,7 @@
  "import psycopg2\n",
  "from IPython.display import Markdown, display\n",
  "from langchain.chains import LLMChain, RetrievalQAWithSourcesChain\n",
- "from langchain_community.docstore.document import Document\n",
+ "from langchain.schema import Document\n",
  "from langchain_community.vectorstores import Yellowbrick\n",
  "from langchain_openai import ChatOpenAI, OpenAIEmbeddings\n",
  "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
@@ -209,14 +209,12 @@
  "\n",
  "# Define the SQL statement to create a table\n",
  "create_table_query = f\"\"\"\n",
- "CREATE TABLE if not exists {embedding_table} (\n",
- " id uuid,\n",
- " embedding_id integer,\n",
- " text character varying(60000),\n",
- " metadata character varying(1024),\n",
- " embedding double precision\n",
+ "CREATE TABLE IF NOT EXISTS {embedding_table} (\n",
+ " doc_id uuid NOT NULL,\n",
+ " embedding_id smallint NOT NULL,\n",
+ " embedding double precision NOT NULL\n",
  ")\n",
- "DISTRIBUTE ON (id);\n",
+ "DISTRIBUTE ON (doc_id);\n",
  "truncate table {embedding_table};\n",
  "\"\"\"\n",
  "\n",
@@ -257,6 +255,8 @@
  " f\"postgres://{urlparse.quote(YBUSER)}:{YBPASSWORD}@{YBHOST}:5432/{YB_DOC_DATABASE}\"\n",
  ")\n",
  "\n",
+ "print(yellowbrick_doc_connection_string)\n",
+ "\n",
  "# Establish a connection to the Yellowbrick database\n",
  "conn = psycopg2.connect(yellowbrick_doc_connection_string)\n",
  "\n",
@@ -324,7 +324,7 @@
  "vector_store = Yellowbrick.from_documents(\n",
  " documents=split_docs,\n",
  " embedding=embeddings,\n",
- " connection_string=yellowbrick_connection_string,\n",
+ " connection_info=yellowbrick_connection_string,\n",
  " table=embedding_table,\n",
  ")\n",
  "\n",
@@ -403,6 +403,88 @@
  "print_result_sources(\"Whats an easy way to add users in bulk to Yellowbrick?\")"
  ]
  },
+ {
+ "cell_type": "markdown",
+ "id": "1f39fd30",
+ "metadata": {},
+ "source": [
+ "## Part 6: Introducing an Index to Increase Performance\n",
+ "\n",
+ "Yellowbrick also supports indexing using the Locality-Sensitive Hashing approach. This is an approximate nearest-neighbor search technique, and allows one to trade off similarity search time at the expense of accuracy. The index introduces two new tunable parameters:\n",
+ "\n",
+ "- The number of hyperplanes, which is provided as an argument to `create_lsh_index(num_hyperplanes)`. The more documents, the more hyperplanes are needed. LSH is a form of dimensionality reduction. The original embeddings are transformed into lower dimensional vectors where the number of components is the same as the number of hyperplanes.\n",
+ "- The Hamming distance, an integer representing the breadth of the search. Smaller Hamming distances result in faster retreival but lower accuracy.\n",
+ "\n",
+ "Here's how you can create an index on the embeddings we loaded into Yellowbrick. We'll also re-run the previous chat session, but this time the retrieval will use the index. Note that for such a small number of documents, you won't see the benefit of indexing in terms of performance."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "02ba61c4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "system_template = \"\"\"Use the following pieces of context to answer the users question.\n",
+ "Take note of the sources and include them in the answer in the format: \"SOURCES: source1 source2\", use \"SOURCES\" in capital letters regardless of the number of sources.\n",
+ "If you don't know the answer, just say that \"I don't know\", don't try to make up an answer.\n",
+ "----------------\n",
+ "{summaries}\"\"\"\n",
+ "messages = [\n",
+ " SystemMessagePromptTemplate.from_template(system_template),\n",
+ " HumanMessagePromptTemplate.from_template(\"{question}\"),\n",
+ "]\n",
+ "prompt = ChatPromptTemplate.from_messages(messages)\n",
+ "\n",
+ "vector_store = Yellowbrick(\n",
+ " OpenAIEmbeddings(),\n",
+ " yellowbrick_connection_string,\n",
+ " embedding_table, # Change the table name to reflect your embeddings\n",
+ ")\n",
+ "\n",
+ "lsh_params = Yellowbrick.IndexParams(\n",
+ " Yellowbrick.IndexType.LSH, {\"num_hyperplanes\": 8, \"hamming_distance\": 2}\n",
+ ")\n",
+ "vector_store.create_index(lsh_params)\n",
+ "\n",
+ "chain_type_kwargs = {\"prompt\": prompt}\n",
+ "llm = ChatOpenAI(\n",
+ " model_name=\"gpt-3.5-turbo\", # Modify model_name if you have access to GPT-4\n",
+ " temperature=0,\n",
+ " max_tokens=256,\n",
+ ")\n",
+ "chain = RetrievalQAWithSourcesChain.from_chain_type(\n",
+ " llm=llm,\n",
+ " chain_type=\"stuff\",\n",
+ " retriever=vector_store.as_retriever(\n",
+ " k=5, search_kwargs={\"index_params\": lsh_params}\n",
+ " ),\n",
+ " return_source_documents=True,\n",
+ " chain_type_kwargs=chain_type_kwargs,\n",
+ ")\n",
+ "\n",
+ "\n",
+ "def print_result_sources(query):\n",
+ " result = chain(query)\n",
+ " output_text = f\"\"\"### Question: \n",
+ " {query}\n",
+ " ### Answer: \n",
+ " {result['answer']}\n",
+ " ### Sources: \n",
+ " {result['sources']}\n",
+ " ### All relevant sources:\n",
+ " {', '.join(list(set([doc.metadata['source'] for doc in result['source_documents']])))}\n",
+ " \"\"\"\n",
+ " display(Markdown(output_text))\n",
+ "\n",
+ "\n",
+ "# Use the chain to query\n",
+ "\n",
+ "print_result_sources(\"How many databases can be in a Yellowbrick Instance?\")\n",
+ "\n",
+ "print_result_sources(\"Whats an easy way to add users in bulk to Yellowbrick?\")"
+ ]
+ },
  {
  "cell_type": "markdown",
  "id": "697c8a38",
@@ -418,9 +500,9 @@
  ],
  "metadata": {
  "kernelspec": {
- "display_name": "langchain_venv",
+ "display_name": "Python 3",
  "language": "python",
- "name": "langchain_venv"
+ "name": "python3"
  },
  "language_info": {
  "codemirror_mode": {

diff --git a/docs/docs/modules/data_connection/indexing.ipynb b/docs/docs/modules/data_connection/indexing.ipynb
@@ -60,7 +60,7 @@
  " * document addition by id (`add_documents` method with `ids` argument)\n",
  " * delete by id (`delete` method with `ids` argument)\n",
  "\n",
- "Compatible Vectorstores: `AnalyticDB`, `AstraDB`, `AzureCosmosDBVectorSearch`, `AzureSearch`, `AwaDB`, `Bagel`, `Cassandra`, `Chroma`, `CouchbaseVectorStore`, `DashVector`, `DatabricksVectorSearch`, `DeepLake`, `Dingo`, `ElasticVectorSearch`, `ElasticsearchStore`, `FAISS`, `HanaDB`, `LanceDB`, `Milvus`, `MyScale`, `OpenSearchVectorSearch`, `PGVector`, `Pinecone`, `Qdrant`, `Redis`, `Rockset`, `ScaNN`, `SupabaseVectorStore`, `SurrealDBStore`, `TimescaleVector`, `UpstashVectorStore`, `Vald`, `VDMS`, `Vearch`, `VespaStore`, `Weaviate`, `ZepVectorStore`, `TencentVectorDB`, `OpenSearchVectorSearch`.\n",
+ "Compatible Vectorstores: `AnalyticDB`, `AstraDB`, `AzureCosmosDBVectorSearch`, `AzureSearch`, `AwaDB`, `Bagel`, `Cassandra`, `Chroma`, `CouchbaseVectorStore`, `DashVector`, `DatabricksVectorSearch`, `DeepLake`, `Dingo`, `ElasticVectorSearch`, `ElasticsearchStore`, `FAISS`, `HanaDB`, `LanceDB`, `Milvus`, `MyScale`, `OpenSearchVectorSearch`, `PGVector`, `Pinecone`, `Qdrant`, `Redis`, `Rockset`, `ScaNN`, `SupabaseVectorStore`, `SurrealDBStore`, `TimescaleVector`, `UpstashVectorStore`, `Vald`, `VDMS`, `Vearch`, `VespaStore`, `Weaviate`, `ZepVectorStore`, `TencentVectorDB`, `OpenSearchVectorSearch`, `Yellowbrick`.\n",
  " \n",
  "## Caution\n",
  "\n",