Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[community] Added SentenceWindowRetriever #21260

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

rsk2327
Copy link
Contributor

@rsk2327 rsk2327 commented May 3, 2024

  • PR title: "package: description"
    • Added appropriate title

Description

  • Adds a new type of retriever called Sentence Window Retriever

  • Also adds a modification to TextSplitter to help implement the retriever

  • Add tests and docs: If you're adding a new integration, please include
    No tests added yet. Let me know if any specific tests are required.
    Plan to add documentation on how to run the retriever

  • Lint and test: Tests ran successfully

Updated TextSplitter to include a new add_chunk_id to add a chunk_id variable into document metadata
Updated chunk_id logic to persist chunk_id across different pages of the same source text
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 3, 2024
Copy link

vercel bot commented May 3, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 15, 2024 10:57am

@dosubot dosubot bot added Ɑ: retriever Related to retriever module Ɑ: text splitters Related to text splitters package 🤖:improvement Medium size change to existing code to handle new use-cases labels May 3, 2024
) -> List[Document]:
"""Sync implementations for retriever."""

if type(self.store) == Chroma:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAJOR: Are you able to implement a vectostore agnostic solution? If we can't make it agnostic, we will likely not add this technique to the code base.

My understanding was that this could be solved by using two abstractions:

  1. retriever: base retriever
  2. document store: an entity that given a document id can return the document content

Would composing or sub-classing ParentRetriever work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with forcing a vectorstore agnostic implementation is that each vectorstore has its own unique way of retrieving specific texts (either based on their ID or the new 'chunk_id' metadata variable).

One way we can make it mostly agnostic is by forcing users to add a specific ID variable to chunks ( as the database's ID and not the chunk_id in the metadata). So for the databases that support filtering based on ID, we can have a general solution.

But for the databases, that dont support such filtering, we might have need a specific implementation. If its crucial to not have custom solutions, then we could decide not to support those databases for this retriever.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally want the implementations to be generic otherwise users will not be able to use them, they become hard to maintain and are hard to test / debug.

It sounds like before this can be implemented, we need to improve the vectorstore interface first.

What functionality do you need from vectorstores?

My understanding is that either one of two techniques will work:

AND = {
   'source_id': source_id,
   'chunk_id' : {
     'between': (chunk_id-window_size, chunk_id+window_size)
   }
}

OR just a function to get documents by ids: get_documents_by_ids


The second case requires populating the metadata of each document with information about its nearby neighbors (including distance and doc ids).


By the way, you can work around this right now by combining BaseStore and VectorStore abstractions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So similar to embeddings.embed_query which is agnostic of the vectorstore implementation, if there is a common vectorstore functionality that allows for filtering on chunk ID (either the database id or the chunk_id in metadata) wherever available.

So the get_documents_by_ids would be something that could ideally achieve that. The first option depends on the filtering syntax for each database. Unless we implement our own filtering syntax that converts filter arguments to the database specific filtering syntax. That would allow us to have an agnostic implementation for this retriever.

Im not sure how the BaseStore + VectorStore would help work around the database specific syntax requirements? If you could share more details on that I can try implementing that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseStore allows getting content by ID -- so it provides a way of doing get_documents_by_id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Let me take a look at that and see if I can implement get_documents_by_id for vectorstore.

libs/text-splitters/langchain_text_splitters/base.py Outdated Show resolved Hide resolved
Updated SWR implementation and implemented get_documents_by_ids for Milvus, Pinecone, Chroma
@efriis efriis added the partner label May 7, 2024
@efriis efriis self-assigned this May 7, 2024
@rsk2327
Copy link
Contributor Author

rsk2327 commented May 7, 2024

@eyurtsev So I have modified the implementation of SWR to be datastore agnostic.

I went with approach of defining a get_document_by_ids function at the vectorstores which enables a common method for querying vectorstore based on IDs.

One of the issues with the implementation of SWR is that the search functionality is not standardized across vectorstores.

Chroma : has similarity_search_by_vector and similarity_search_by_vector_with_score
Pinecone : does not have similarity_search_by_vector but only similarity_search_by_vector_with_score
Milvus : has similarity_search_by_vector but instead of similarity_search_by_vector_with_score has 'similarity_search_with_score_by_vector' which is probably a typo

I can work around the the different search method names for now, but it might be helpful for Pinecone to also have a 'similarity_search_by_vector' implementation and for Milvus to use the same standardized function names like the other vectorstores.

@rsk2327 rsk2327 requested a review from eyurtsev May 8, 2024 23:10
@rsk2327
Copy link
Contributor Author

rsk2327 commented May 9, 2024

@eyurtsev @efriis Could I get a review on this?

Let me know if I need to add any additional details to explain the changes made.

I did have a question on how to include the langchain_pinecone as a dependency within community. The unit tests throw an error when I import PineconeVectorStore from langchain_pinecone

@rsk2327
Copy link
Contributor Author

rsk2327 commented May 13, 2024

@eyurtsev @efriis Can I get a review on this?

Removing import codes for now as they were running into unknown issues. Will resolve them later once core code is verified.
Copy link

vercel bot commented May 15, 2024

Deployment failed with the following error:

The provided GitHub repository does not contain the requested branch or commit reference. Please ensure the repository is not empty.

@rsk2327
Copy link
Contributor Author

rsk2327 commented May 15, 2024

@eyurtsev @efriis @hwchase17 @baskaryan Can I get a review on this?

@eyurtsev
Copy link
Collaborator

@rsk2327 You'll need to standby for ~1 month. We'll be focusing on the vectorstore abstractions after the 0.2 release.

The main things so far:

  • Addition of get_documents_by_ids to the base abstraction
  • Potentially addition of an id attribute on a document (so the ID is not randomly in the metadata).

Text splitters:

  • Determine what if any kind of metadata we should be propagating in the text splitter for provenance purposes

I'll leave some comments in the PR itself as well

@@ -200,6 +200,7 @@ def similarity_search_by_vector_with_score(
k: int = 4,
filter: Optional[dict] = None,
namespace: Optional[str] = None,
include_id: Optional[bool] = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to the search API right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might have missed a word in there. Are you suggesting not to include the 'include_id' argument?

Is this related to you first comment which said that we need to add an 'id' attribute to Document? Which I guess would make the include_id argument unnecessary?



def _results_to_docs_and_scores(
results: Any, include_id: Optional[bool] = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: code is not properly typed, what is results?

@@ -357,13 +370,15 @@ def similarity_search_by_vector(
k: int = DEFAULT_K,
filter: Optional[Dict[str, str]] = None,
where_document: Optional[Dict[str, str]] = None,
include_id: Optional[bool] = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to be careful in terms of how we deal with the ID, so it can be rolled out throughout the various integrations.

):
metadata = result[1] or {}
if include_id:
metadata["id"] = result[3]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search logic should not be modifying metadata.

It's OK if it's present during indexing, but shouldn't be mutated on the search path, as the vectorstore should be returning the document as it was indexed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was an iffy change. I didnt want to create an entirely new key to store ID so went with just adding it to the metadata. But if we go with your suggestion from the first comment and setup a new ID attribute for documents, that would resolve such issues.

@@ -1081,3 +1087,40 @@ def upsert(
"Failed to upsert entities: %s error: %s", self.collection_name, exc
)
raise exc

def get_documents_by_ids(self, ids: int | str | List[int | str]) -> List[Document]:
# Generating filtering expr for passing to query function
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: map into the same code path, or force users to always think in terms of batch (it's good bias since the code involves round trips between client code and server)

if not isinstance(ids, (list, tuple)):
ids = [ids]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:improvement Medium size change to existing code to handle new use-cases partner Ɑ: retriever Related to retriever module size:L This PR changes 100-499 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants