Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: "Cannot submit more than 5,461 embeddings at once. Please submit your embeddings in batches of size 5,461 or less." but on running *.delete* #2181

Open
niceblue88 opened this issue May 10, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@niceblue88
Copy link

What happened?

I'm having some serious useability problmes with this embedding limit. I had it on upsert, where i think it is understandable somewhat. I have chunked it as described here in #1049 and that one now works fine.
#1049
HOWEVER, I am also having this same problem inexplicably with delete too, which I find much harder to understand why it has to be so with delete. This error is triggered by:
collection.delete(where={"dochash": dochash})
where dochash is a single simple dochash string only.
I would think this is an extremely common use case, and not something that can be chunked.

Versions

Chroma v0.5.0. Running on Windows. Python 3.10

Relevant log output

No response

@niceblue88 niceblue88 added the bug Something isn't working label May 10, 2024
@niceblue88 niceblue88 changed the title [Bug]: "Cannot submit more than 5,461 embeddings at once. Please submit your embeddings in batches of size 5,461 or less." but on DELETE [Bug]: "Cannot submit more than 5,461 embeddings at once. Please submit your embeddings in batches of size 5,461 or less." but on running *.delete* May 10, 2024
@tazarov
Copy link
Contributor

tazarov commented May 11, 2024

@niceblue88, sorry you're facing this issue. Let me try to explain why it is happening and we can then explore options to fix this.

The sqlite3 version on your system is compiled with some limits in it. One of these limits is MAX_VARIABLE_NUMBER which means that sqlite3 can take up only this many variables at once. This by extensions is reflected in the max_batch_size in Chroma. So every time you add/update/upsert/delete records, this is checked against the number of records you are trying to add/update/upsert/delete. You correctly point out that you are not supplying ids to be deleted in your delete(). However, Chroma turns your where clause into a list of records to delete, which in your case exceeds max_batch_size, hence the error you see.

I must agree that this is not a pleasant issue to have and indeed, it should work as expected, regardless of how many entries the where clause matches. I cannot say without further investigation how much work and impact making this work would be. Nevertheless, I'll dig deeper.

In the meantime, here's a way to avoid this problem:

client = chromadb.PersistentClient(path="persist_dir")
ids_to_delete = collection.get(where={"dochash": dochash})

for batch_index in range(0, len(ids_to_delete),client.max_batch_size):
   collection.delete(ids=[ids_to_delete: ids_to_delete+client.max_batch_size)

@niceblue88
Copy link
Author

Great reply, thank you for explaining so clearly. I thought it was exactly this, but the details you provide on SQLite3 is helpful. Digging into SQLite, any version higher than 3.32 is supposed to have a max_variable_number of 32,726. I checked the version of sqlite3 in Windows python, and it tells me it is 3.43.1. Correct if I am wrong, but that should mean it can handle at least 32,000 ids? It seems to not though. Is that because it is not using the sqlite3 version I see in a python windows, and using some other sqlite3 instance? Or is it that the 3.43.1 version has still for some reason been compiled with a lower max_variable_number? I presume this SQLite comae with the python 3.11 version install.

Either way, I think ideally this is something that is addressed upfront in the install guide for ChromDB. I will look into perhaps trying to upgrade my sqlite version manually in Windows, and see if that fixes the problem. If it does, perhaps that should be the recommendation in the install guide (for Windows). And for others, perhaps ways of ensure the max_variable_number is also similarly large. However, I now understand why others also hit this problem, but at a much higher threshold of over 40,000 ids.

@tazarov
Copy link
Contributor

tazarov commented May 14, 2024

@niceblue88, I think that the SQLite3 version is heavily dependent on its compile config. For instance on my Mac M3 it can support up to 83k of max_batch_size aka records. You can see how max_batch_size is calculated here:

def max_batch_size(self) -> int:
if self._max_batch_size is None:
with self.tx() as cur:
cur.execute("PRAGMA compile_options;")
compile_options = cur.fetchall()
for option in compile_options:
if "MAX_VARIABLE_NUMBER" in option[0]:
# The pragma returns a string like 'MAX_VARIABLE_NUMBER=999'
self._max_batch_size = int(option[0].split("=")[1]) // (
self.VARIABLES_PER_RECORD
)
if self._max_batch_size is None:
# This value is the default for sqlite3 versions < 3.32.0
# It is the safest value to use if we can't find the pragma for some
# reason
self._max_batch_size = 999 // self.VARIABLES_PER_RECORD
return self._max_batch_size

It is a good idea to add a note to the install or even the API usage to inform users about the system-specific limitations of Chroma and the underlying SQLite3.

@niceblue88
Copy link
Author

I know was considered, but the chunking could be included by default in Chroma (with negligible overhead if ids do not exceed max). Max 5 lines needed on the lib side (as opposed to every client needing to implement chunking). Why is this not done?

@tazarov
Copy link
Contributor

tazarov commented May 20, 2024

I know was considered, but the chunking could be included by default in Chroma (with negligible overhead if ids do not exceed max). Max 5 lines needed on the lib side (as opposed to every client needing to implement chunking). Why is this not done?

You are right that this may be a few lines, but it comes with significant assumptions. We have been through this a while back have a look here - #1077 (review)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants