[community] Added SentenceWindowRetriever #21260

rsk2327 · 2024-05-03T16:59:42Z

PR title: "package: description"
- Added appropriate title

Description

Adds a new type of retriever called Sentence Window Retriever
Also adds a modification to TextSplitter to help implement the retriever
Add tests and docs: If you're adding a new integration, please include
No tests added yet. Let me know if any specific tests are required.
Plan to add documentation on how to run the retriever
Lint and test: Tests ran successfully

Updated TextSplitter to include a new add_chunk_id to add a chunk_id variable into document metadata

Updated chunk_id logic to persist chunk_id across different pages of the same source text

vercel · 2024-05-03T16:59:47Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 15, 2024 10:57am

libs/community/langchain_community/retrievers/sentence_window_retriever.py

eyurtsev · 2024-05-03T17:33:52Z

libs/community/langchain_community/retrievers/sentence_window_retriever.py

+ ) -> List[Document]:
+ """Sync implementations for retriever."""
+
+ if type(self.store) == Chroma:


MAJOR: Are you able to implement a vectostore agnostic solution? If we can't make it agnostic, we will likely not add this technique to the code base.

My understanding was that this could be solved by using two abstractions:

retriever: base retriever

document store: an entity that given a document id can return the document content

Would composing or sub-classing ParentRetriever work?

The issue with forcing a vectorstore agnostic implementation is that each vectorstore has its own unique way of retrieving specific texts (either based on their ID or the new 'chunk_id' metadata variable).

One way we can make it mostly agnostic is by forcing users to add a specific ID variable to chunks ( as the database's ID and not the chunk_id in the metadata). So for the databases that support filtering based on ID, we can have a general solution.

But for the databases, that dont support such filtering, we might have need a specific implementation. If its crucial to not have custom solutions, then we could decide not to support those databases for this retriever.

We generally want the implementations to be generic otherwise users will not be able to use them, they become hard to maintain and are hard to test / debug.

It sounds like before this can be implemented, we need to improve the vectorstore interface first.

What functionality do you need from vectorstores?

My understanding is that either one of two techniques will work:

AND = { 'source_id': source_id, 'chunk_id' : { 'between': (chunk_id-window_size, chunk_id+window_size) } }

OR just a function to get documents by ids: get_documents_by_ids

The second case requires populating the metadata of each document with information about its nearby neighbors (including distance and doc ids).

By the way, you can work around this right now by combining BaseStore and VectorStore abstractions.

So similar to embeddings.embed_query which is agnostic of the vectorstore implementation, if there is a common vectorstore functionality that allows for filtering on chunk ID (either the database id or the chunk_id in metadata) wherever available.

So the get_documents_by_ids would be something that could ideally achieve that. The first option depends on the filtering syntax for each database. Unless we implement our own filtering syntax that converts filter arguments to the database specific filtering syntax. That would allow us to have an agnostic implementation for this retriever.

Im not sure how the BaseStore + VectorStore would help work around the database specific syntax requirements? If you could share more details on that I can try implementing that.

BaseStore allows getting content by ID -- so it provides a way of doing get_documents_by_id

Thanks. Let me take a look at that and see if I can implement get_documents_by_id for vectorstore.

libs/text-splitters/langchain_text_splitters/base.py

Updated SWR implementation and implemented get_documents_by_ids for Milvus, Pinecone, Chroma

rsk2327 · 2024-05-07T18:54:07Z

@eyurtsev So I have modified the implementation of SWR to be datastore agnostic.

I went with approach of defining a get_document_by_ids function at the vectorstores which enables a common method for querying vectorstore based on IDs.

One of the issues with the implementation of SWR is that the search functionality is not standardized across vectorstores.

Chroma : has similarity_search_by_vector and similarity_search_by_vector_with_score
Pinecone : does not have similarity_search_by_vector but only similarity_search_by_vector_with_score
Milvus : has similarity_search_by_vector but instead of similarity_search_by_vector_with_score has 'similarity_search_with_score_by_vector' which is probably a typo

I can work around the the different search method names for now, but it might be helpful for Pinecone to also have a 'similarity_search_by_vector' implementation and for Milvus to use the same standardized function names like the other vectorstores.

rsk2327 · 2024-05-09T16:33:32Z

@eyurtsev @efriis Could I get a review on this?

Let me know if I need to add any additional details to explain the changes made.

I did have a question on how to include the langchain_pinecone as a dependency within community. The unit tests throw an error when I import PineconeVectorStore from langchain_pinecone

rsk2327 · 2024-05-13T18:24:22Z

@eyurtsev @efriis Can I get a review on this?

Removing import codes for now as they were running into unknown issues. Will resolve them later once core code is verified.

vercel · 2024-05-15T10:08:16Z

Deployment failed with the following error:

The provided GitHub repository does not contain the requested branch or commit reference. Please ensure the repository is not empty.

rsk2327 · 2024-05-15T19:56:18Z

@eyurtsev @efriis @hwchase17 @baskaryan Can I get a review on this?

eyurtsev · 2024-05-16T21:42:15Z

@rsk2327 You'll need to standby for ~1 month. We'll be focusing on the vectorstore abstractions after the 0.2 release.

The main things so far:

Addition of get_documents_by_ids to the base abstraction
Potentially addition of an id attribute on a document (so the ID is not randomly in the metadata).

Text splitters:

Determine what if any kind of metadata we should be propagating in the text splitter for provenance purposes

I'll leave some comments in the PR itself as well

eyurtsev · 2024-05-16T21:35:04Z

libs/partners/pinecone/langchain_pinecone/vectorstores.py

@@ -200,6 +200,7 @@ def similarity_search_by_vector_with_score(
 k: int = 4,
 filter: Optional[dict] = None,
 namespace: Optional[str] = None,
+ include_id: Optional[bool] = False,


We don't want to the search API right now.

You might have missed a word in there. Are you suggesting not to include the 'include_id' argument?

Is this related to you first comment which said that we need to add an 'id' attribute to Document? Which I guess would make the include_id argument unnecessary?

eyurtsev · 2024-05-16T21:42:47Z

libs/community/langchain_community/vectorstores/chroma.py

+
+
+def _results_to_docs_and_scores(
+ results: Any, include_id: Optional[bool] = False


nit: code is not properly typed, what is results?

eyurtsev · 2024-05-16T21:43:49Z

libs/community/langchain_community/vectorstores/chroma.py

@@ -357,13 +370,15 @@ def similarity_search_by_vector(
 k: int = DEFAULT_K,
 filter: Optional[Dict[str, str]] = None,
 where_document: Optional[Dict[str, str]] = None,
+ include_id: Optional[bool] = False,


We'll need to be careful in terms of how we deal with the ID, so it can be rolled out throughout the various integrations.

eyurtsev · 2024-05-16T21:44:41Z

libs/community/langchain_community/vectorstores/chroma.py

+ ):
+ metadata = result[1] or {}
+ if include_id:
+ metadata["id"] = result[3]


search logic should not be modifying metadata.

It's OK if it's present during indexing, but shouldn't be mutated on the search path, as the vectorstore should be returning the document as it was indexed.

Yeah, this was an iffy change. I didnt want to create an entirely new key to store ID so went with just adding it to the metadata. But if we go with your suggestion from the first comment and setup a new ID attribute for documents, that would resolve such issues.

eyurtsev · 2024-05-16T21:46:46Z

libs/community/langchain_community/vectorstores/milvus.py

@@ -1081,3 +1087,40 @@ def upsert(
 "Failed to upsert entities: %s error: %s", self.collection_name, exc
 )
 raise exc
+
+ def get_documents_by_ids(self, ids: int | str | List[int | str]) -> List[Document]:
+ # Generating filtering expr for passing to query function


nit: map into the same code path, or force users to always think in terms of batch (it's good bias since the code involves round trips between client code and server)

if not isinstance(ids, (list, tuple)):
ids = [ids]

rsk2327 added 7 commits April 27, 2024 23:47

Added new metadata variable : chunk_id

a71b6b8

Updated TextSplitter to include a new add_chunk_id to add a chunk_id variable into document metadata

Update base.py

68cef24

Updated chunk_id logic to persist chunk_id across different pages of the same source text

Merge branch 'master' into master

a9889ee

Added SentenceWindowRetriever

efb848d

Merge branch 'master' of https://github.com/rsk2327/langchain

679e5d3

Update sentence_window_retriever.py

a77318f

Update sentence_window_retriever.py

5e2e6ca

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 3, 2024

dosubot bot added Ɑ: retriever Related to retriever module Ɑ: text splitters Related to text splitters package 🤖:improvement Medium size change to existing code to handle new use-cases labels May 3, 2024

eyurtsev requested changes May 3, 2024

View reviewed changes

Implement get_documents_by_ids

f38fd95

Updated SWR implementation and implemented get_documents_by_ids for Milvus, Pinecone, Chroma

efriis added the partner label May 7, 2024

efriis self-assigned this May 7, 2024

Merge branch 'master' into master

03d6c35

rsk2327 added 3 commits May 7, 2024 11:59

Update base.py

c1a1fb5

Merge branch 'master' of https://github.com/rsk2327/langchain

273fdb1

Merge branch 'master' into master

2752876

rsk2327 requested a review from eyurtsev May 8, 2024 23:10

rsk2327 added 2 commits May 9, 2024 07:41

Updated for linting issues

b3a8df1

Merge branch 'master' of https://github.com/rsk2327/langchain

06488e0

vercel bot deployed to Preview May 9, 2024 14:53 View deployment

rsk2327 added 2 commits May 9, 2024 07:58

Updated for ModuleNotFoundError

46f91af

Updates for unit_tests

ad2c3df

vercel bot deployed to Preview May 9, 2024 16:32 View deployment

rsk2327 added 2 commits May 14, 2024 15:29

resolved import conflict

a0508b2

Removed import codes

d82b6bc

Removing import codes for now as they were running into unknown issues. Will resolve them later once core code is verified.

vercel bot deployed to Preview May 14, 2024 22:51 View deployment

Merge branch 'master' into master

36dc48d

vercel bot deployed to Preview May 14, 2024 23:16 View deployment

Update test_imports.py

72c6dba

Merge branch 'master' into master

c57ac20

vercel bot deployed to Preview May 15, 2024 10:57 View deployment

eyurtsev reviewed May 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[community] Added SentenceWindowRetriever #21260

[community] Added SentenceWindowRetriever #21260

rsk2327 commented May 3, 2024 •

edited

vercel bot commented May 3, 2024 •

edited

eyurtsev May 3, 2024

rsk2327 May 3, 2024

eyurtsev May 3, 2024

rsk2327 May 3, 2024

eyurtsev May 3, 2024

rsk2327 May 3, 2024

rsk2327 commented May 7, 2024 •

edited

rsk2327 commented May 9, 2024

rsk2327 commented May 13, 2024

vercel bot commented May 15, 2024

rsk2327 commented May 15, 2024

eyurtsev commented May 16, 2024

eyurtsev May 16, 2024

rsk2327 May 17, 2024

eyurtsev May 16, 2024

eyurtsev May 16, 2024

eyurtsev May 16, 2024

rsk2327 May 17, 2024

eyurtsev May 16, 2024



		def _results_to_docs_and_scores(
		results: Any, include_id: Optional[bool] = False

[community] Added SentenceWindowRetriever #21260

Are you sure you want to change the base?

[community] Added SentenceWindowRetriever #21260

Conversation

rsk2327 commented May 3, 2024 • edited

vercel bot commented May 3, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rsk2327 commented May 7, 2024 • edited

rsk2327 commented May 9, 2024

rsk2327 commented May 13, 2024

vercel bot commented May 15, 2024

rsk2327 commented May 15, 2024

eyurtsev commented May 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rsk2327 commented May 3, 2024 •

edited

vercel bot commented May 3, 2024 •

edited

rsk2327 commented May 7, 2024 •

edited