New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record manager considers some of the documents as updated while they are not changed #21028
Comments
@MohammedShokr thanks for reporting. There isn't enough context here to determine whether the bug is in the SQLRecordManager or in another component in LangChain or in user code. If you're able to isolate the problem or provide a minimal reproducible script. It should contain all relevant imports and data to index. |
HI @eyurtsev |
Update: |
@MohammedShokr for a minimal reproducible example, it's enough to seed this with a test case involving fake data. documents = [
Document(page_content='hello', metadata={'source': 1}),
Document(page_content='goodbye', metadata={'source': 2})
Document(page_content='meow', metadata={'source': 3})
Document(page_content='woof', metadata={'source': 1})
] Is the claim that indexing this with a batch size of 2 creates incorrect results? If you're able to create a test case like that together with information about you see. vs what you expect to see that's very helpful for us to fix the issue |
Here's a script to reproduce the issue, run this script twice and you will see that the record manager re-ingests the first document because one of its chunks came out of the batch.
|
OK I recreated the issue.
If this condition is not met, the indexing code will not be able to avoid some redundant work (i.e., it'll end up forcefully re-indexing content that it should've skipped). The end state of the index is still correct (as long as there was no network failure in the middle etc.) What I need to do is:
|
Thank you for clarifying the situation, I'm considering increasing the |
That's correct: it will decrease redundant work, but increase time when duplicates might exist. You can handle the issue entirely on your side by grouping documents that share the same source id into the same batch, and controlling the batch size dynamically. I haven't checked, but i hope that the indexing API works without a batch size, if that's the case you should be able to entirely control the indexing behavior without any oddities of having to dynamically calculate a batch size. |
Thank you for the insight! Closing the issue now. |
Checked other resources
Example Code
Error Message and Stack Trace (if applicable)
No response
Description
When indexing a list of documents utilizing the record manager in incremental deletion mode, with each document assigned a unique identifier (UUID) as the source, I encounter an unexpected behavior. The record manager deletes and re-indexes a subset of documents even though there have been no changes to those documents. Upon rerunning the same code with identical documents, the output is
{'num_added': 80, 'num_updated': 0, 'num_skipped': 525, 'num_deleted': 80}
.Furthermore, I am using a recursive text splitter to segment the documents; also I am generating a summary for each document and I change the summary metadata to add the source of the original document so it is considered as a chunk of the original document.
Finally, please note that I tried the same code on different sets of documents and the issue is not consistent.
System Info
System Information
Package Information
Packages not installed (Not Necessarily a Problem)
The following packages were not found:
The text was updated successfully, but these errors were encountered: