New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[community] Added SentenceWindowRetriever #21260
base: master
Are you sure you want to change the base?
Conversation
Updated TextSplitter to include a new add_chunk_id to add a chunk_id variable into document metadata
Updated chunk_id logic to persist chunk_id across different pages of the same source text
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
libs/community/langchain_community/retrievers/sentence_window_retriever.py
Outdated
Show resolved
Hide resolved
) -> List[Document]: | ||
"""Sync implementations for retriever.""" | ||
|
||
if type(self.store) == Chroma: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAJOR: Are you able to implement a vectostore agnostic solution? If we can't make it agnostic, we will likely not add this technique to the code base.
My understanding was that this could be solved by using two abstractions:
- retriever: base retriever
- document store: an entity that given a document id can return the document content
Would composing or sub-classing ParentRetriever
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with forcing a vectorstore agnostic implementation is that each vectorstore has its own unique way of retrieving specific texts (either based on their ID or the new 'chunk_id' metadata variable).
One way we can make it mostly agnostic is by forcing users to add a specific ID variable to chunks ( as the database's ID and not the chunk_id in the metadata). So for the databases that support filtering based on ID, we can have a general solution.
But for the databases, that dont support such filtering, we might have need a specific implementation. If its crucial to not have custom solutions, then we could decide not to support those databases for this retriever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally want the implementations to be generic otherwise users will not be able to use them, they become hard to maintain and are hard to test / debug.
It sounds like before this can be implemented, we need to improve the vectorstore interface first.
What functionality do you need from vectorstores?
My understanding is that either one of two techniques will work:
AND = {
'source_id': source_id,
'chunk_id' : {
'between': (chunk_id-window_size, chunk_id+window_size)
}
}
OR just a function to get documents by ids: get_documents_by_ids
The second case requires populating the metadata of each document with information about its nearby neighbors (including distance and doc ids).
By the way, you can work around this right now by combining BaseStore and VectorStore abstractions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So similar to embeddings.embed_query
which is agnostic of the vectorstore implementation, if there is a common vectorstore functionality that allows for filtering on chunk ID (either the database id or the chunk_id in metadata) wherever available.
So the get_documents_by_ids
would be something that could ideally achieve that. The first option depends on the filtering syntax for each database. Unless we implement our own filtering syntax that converts filter arguments to the database specific filtering syntax. That would allow us to have an agnostic implementation for this retriever.
Im not sure how the BaseStore + VectorStore would help work around the database specific syntax requirements? If you could share more details on that I can try implementing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BaseStore
allows getting content by ID -- so it provides a way of doing get_documents_by_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Let me take a look at that and see if I can implement get_documents_by_id
for vectorstore.
Updated SWR implementation and implemented get_documents_by_ids for Milvus, Pinecone, Chroma
@eyurtsev So I have modified the implementation of SWR to be datastore agnostic. I went with approach of defining a One of the issues with the implementation of SWR is that the search functionality is not standardized across vectorstores. Chroma : has I can work around the the different search method names for now, but it might be helpful for Pinecone to also have a 'similarity_search_by_vector' implementation and for Milvus to use the same standardized function names like the other vectorstores. |
@eyurtsev @efriis Could I get a review on this? Let me know if I need to add any additional details to explain the changes made. I did have a question on how to include the langchain_pinecone as a dependency within community. The unit tests throw an error when I import PineconeVectorStore from langchain_pinecone |
Removing import codes for now as they were running into unknown issues. Will resolve them later once core code is verified.
Deployment failed with the following error:
|
@eyurtsev @efriis @hwchase17 @baskaryan Can I get a review on this? |
@rsk2327 You'll need to standby for ~1 month. We'll be focusing on the vectorstore abstractions after the 0.2 release. The main things so far:
Text splitters:
I'll leave some comments in the PR itself as well |
@@ -200,6 +200,7 @@ def similarity_search_by_vector_with_score( | |||
k: int = 4, | |||
filter: Optional[dict] = None, | |||
namespace: Optional[str] = None, | |||
include_id: Optional[bool] = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to the search API right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might have missed a word in there. Are you suggesting not to include the 'include_id' argument?
Is this related to you first comment which said that we need to add an 'id' attribute to Document? Which I guess would make the include_id argument unnecessary?
|
||
|
||
def _results_to_docs_and_scores( | ||
results: Any, include_id: Optional[bool] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: code is not properly typed, what is results?
@@ -357,13 +370,15 @@ def similarity_search_by_vector( | |||
k: int = DEFAULT_K, | |||
filter: Optional[Dict[str, str]] = None, | |||
where_document: Optional[Dict[str, str]] = None, | |||
include_id: Optional[bool] = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to be careful in terms of how we deal with the ID, so it can be rolled out throughout the various integrations.
): | ||
metadata = result[1] or {} | ||
if include_id: | ||
metadata["id"] = result[3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
search logic should not be modifying metadata.
It's OK if it's present during indexing, but shouldn't be mutated on the search path, as the vectorstore should be returning the document as it was indexed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was an iffy change. I didnt want to create an entirely new key to store ID so went with just adding it to the metadata. But if we go with your suggestion from the first comment and setup a new ID attribute for documents, that would resolve such issues.
@@ -1081,3 +1087,40 @@ def upsert( | |||
"Failed to upsert entities: %s error: %s", self.collection_name, exc | |||
) | |||
raise exc | |||
|
|||
def get_documents_by_ids(self, ids: int | str | List[int | str]) -> List[Document]: | |||
# Generating filtering expr for passing to query function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: map into the same code path, or force users to always think in terms of batch (it's good bias since the code involves round trips between client code and server)
if not isinstance(ids, (list, tuple)):
ids = [ids]
Description
Adds a new type of retriever called Sentence Window Retriever
Also adds a modification to TextSplitter to help implement the retriever
Add tests and docs: If you're adding a new integration, please include
No tests added yet. Let me know if any specific tests are required.
Plan to add documentation on how to run the retriever
Lint and test: Tests ran successfully