[BUG]: The batch, the sync and the missing vector #2062

tazarov · 2024-04-25T18:09:02Z

Description of changes

Summarize the changes made by this PR.

Improvements & Bug fixes
- Specific conditions cause metadata and binary indices to go out of sync and cause an error in get() with vector data being lost

Test plan

How are these changes tested?

Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Affected issues

The following is a list of discord discussions related to this issue

Root Cause Analysis

TLDR: Under specific (note: specific not special) conditions metadata and vector segments go out of sync due to a batching mechanics causing vector data to be lost.

The Detail

A simple scenario: A user adds data to a collection, enough for their data to be moved from bruteforce index to HNSW (e.g. batch_size is exceeded, defaults to 100). At some point, the user decides they need to update a document (already in HNSW) and replace it with a fresh copy, a fairly common use case to make RAG systems useful. Down the line, the user uses delete() to remove the desired document's id and add() to update the new document. Chroma offers upsert(), but given the number of affected issues and discussions. Experience shows some people prefer delete/add mechanics over upsert. At the moment of insertion of the new data, they are greeted with Add of existing embedding ID:. It seems like a warning, but most people, including myself, didn’t think much of it (I even went as far as to create a PR to bypass the warnings in WAL replays - https://github.com/chroma-core/chroma/pull/1763/files). The reality is that underneath the HNSW batching mechanism was silently discarding vector data for recently deleted vectors and thus causing meta and vector segments to go out of sync, leading to the following three types of problems with a subsequent get(include=["embeddings"]) :

IndexError: list assignment index out of range
TypeError: 'NoneType' object is not subscriptable
No error at all but a mismatch in returned data lengths for IDs and embeddings

Note: We’ll cover more on the above errors and why the reason for the inconsistent error scenarios for the same underlying issue

How to reproduce

import chromadb
import shutil

shutil.rmtree("get_vector_test", ignore_errors=True)
client = chromadb.PersistentClient("get_vector_test")

collection = client.get_or_create_collection('test', metadata={"hnsw:batch_size":10, "hnsw:sync_threshold":20})
import uuid

items = [(f"index-id-{i}-{uuid.uuid4()}", i, [0.1] * 2) for i in range(11)]
ids = [item[0] for item in items]
embeddings = [item[2] for item in items]
collection.add(ids=ids,embeddings=embeddings)

print("Working with id: ", ids[0])
collection.delete(ids=[ids[0]])
collection.add(ids=[ids[0]],embeddings=[[1] * 2])
collection.get(include=['embeddings'])

What is affected

The defect affects PersistentClient and Chroma server.

**Why isn't in-memory affected: **

In-memory indices are not affected because the batch is updated and synchronized at the end of each transaction. Consider the following two locations in _write_records of local_hnsw.py

chroma/chromadb/segment/impl/vector/local_hnsw.py

Line 291 in c3db12e

batch = Batch()
chroma/chromadb/segment/impl/vector/local_hnsw.py

Line 321 in c3db12e

self._apply_batch(batch)

What really happens

Let’s start by visualizing things to illustrate how the defect works:

The happy path

The happy path is the following layout which is a normal vector segment layout. We have some data in HNSW and some in the bruteforce (BF) index. The Batch keeps track of things being added and deleted so that we can sync them happily after a batch_size overflow of the BF.

A regular query results in the vector segment would look like this for the above layout:

There are two loops in get_vectors() method:

chroma/chromadb/segment/impl/vector/local_persistent_hnsw.py

Lines 315 to 323 in 99381f2

 for i, id in enumerate(target_ids): 

 if id in ids_bf: 

 results.append(self._brute_force_index.get_vectors([id])[0]) 

 elif id in ids_hnsw and id not in self._curr_batch._deleted_ids: 

 hnsw_labels.append(self._id_to_label[id]) 

 # Placeholder for hnsw results to be filled in down below so we 

 # can batch the hnsw get() call 

 results.append(None) 

 id_to_index[id] = i

initial loop that prefills results from the BF

chroma/chromadb/segment/impl/vector/local_persistent_hnsw.py

Lines 325 to 332 in 99381f2

 if len(hnsw_labels) > 0 and self._index is not None: 

 vectors = cast(Sequence[Vector], self._index.get_items(hnsw_labels)) 

 for label, vector in zip(hnsw_labels, vectors): 

 id = self._label_to_id[label] 

 results[id_to_index[id]] = VectorEmbeddingRecord( 

 id=id, embedding=vector 

 )

secondary batching loop which fetches a batch of vectors from HNSW and fills them into the result set

When things operate under normal conditions, as seen above, the id_to_index and the results align perfectly.

Now, let’s look at what happens when a vector is removed:

The above shows the vector segment layout (state) after a delete() operation. Important fact to observe here that while ID 1 goes into the deleted items in the batch it is not yet removed from the HNSW index (including its metadata held in _id_to_label , _label_to_id and _id_to_seq_id. Keep this in mind it’s important in the next diagram. Sending a get at this stage will return the correct results as HNSW vectors are fetched with IDs coming from the Metadata index (

chroma/chromadb/api/segment.py

Line 540 in 99381f2

vectors = vector_segment.get_vectors(ids=vector_ids)

).

The metadata segment is successfully updated to remove the ID from the sqlite tables:

So what happens when we add():

The WAL (Embedding Queue) works in a pub-sub way where each segment registers for updates. Each time a user adds data to Chroma, the embedding queue distributes that to all segment subscriptions. In single-node Chroma, there are just two segments for each collection:

Metadata segment subscription
Vector/HNSW local segment subscription

To ensure that your data is safely stored in the segments, Chroma sequentially and synchronously notifies each segment. Sequencing provides no guarantees of ordering which segment gets the update first (

chroma/chromadb/db/mixins/embeddings_queue.py

Lines 359 to 363 in a265673

 def _notify_all(self, topic: str, embeddings: Sequence[LogRecord]) -> None: 

 """Send a notification to each subscriber of the given topic.""" 

 if self._running: 

 for sub in self._subscriptions[topic]: 

 self._notify_one(sub, embeddings)

). While this may seem not so relevant it is an important detail when considering the solution to this problem.

As seen in the diagram above the metadata is updated fine as it did not have any references of 1 while the vector segment rejects the update as it can still see the ID in its _id_to_label HNSW metadata. It is important to observe that the rejection does not result in an exception but a mere warning, which in a client/server setup does not even make it to the client.

So here we are - the metadata and vector segments are out of sync. This out-of-sync is not immediately visible other add(), query() etc. all work just fine until you get to get(). That is where you get confronted with the errors above when you also try to include the embeddings (vectors).

But why does this problem surface in three different ways? The answer is deceitfully simple - key arrangement of id_to_index dictionary (

chroma/chromadb/segment/impl/vector/local_persistent_hnsw.py

Line 314 in a265673

id_to_index: Dict[str, int] = {}

)

The arrangement largely depends on the IDs used; in our experiments, we used UUIDv4 which appears to be the more common approach people take to generating IDs in Chroma. The inherent random nature of uuids makes key ordering within id_to_index unpredictable. In our experimentation, we’ve observed the following three states of the keys within id_to_index:

As exhibited by the diagram in the out-of-sync layout of vector segment the baseline IDs come from metadata segment but the color coding indicates which subset of the vector segment they belong to - BF or HNSW or missing (in red) if in neither.

In (1), the missing ID is at the beginning, so the batch ID fetching and assignment in results are not affected, which lets the results to surface in SegmentAPI where an error TypeError: 'NoneType' object is not subscriptable is thrown as the first item in results is None.

In (2), the missing ID is somewhere in the middle of the keys so an error IndexError: list assignment index out of range within the vector segment during the batch fetching and assignment of results.

In (3), the missing ID is at the end of the id_to_index keys, which lets the missing result go through both the vector segment and SegmentAPI completely unnoticed and results with incongruent cardinalities returned to the client.

Note: It is possible that we missed a corner case of (1) where there is a different ordering of BF and HNSW keys. However, in our experimentation, we have not observed any other than the aforementioned errors. The conclusion is that if there are some sub-cases of (1) or even any of the other two scenarios, they result in the same error set.

Here’s the error distribution of the errors:

Key takeaways

The defect is easy to reproduce
The inconsistency of errors (although distribution heavily leans in the IndexError ) makes the issue a bit difficult to diagnose, especially on unlucky distributions of uuids or whatever IDs are being used in tests
Metadata and vector segments go out of sync
Use of upsert() on a missing ID fixes the out-of-sync for that record
query(), which relies on metadata pre-filtering and HNSW filtering, does not appear to be affected by an execution error. However, User expectations might not be met given the document is visible in Chroma, but a search for a similar or exact item does not appear to match it
Original data can be recovered from WAL but requires specialized tooling
The extent of the issue is hard to estimate as it has existed ever since 0.4.x

Solutions

We’ve explored four possible solutions as follows:

🚫Change in delete mechanics to bypass batching - immediately remove vectors from HNSW and its metadata - after discussion with @HammadB, we decided to keep the batch semantics as immediate removal makes it harder to reason about.
🚫Adding dict(s) to track deleted vectors and labels and the associated implementation to filter them out from get() and query() - We decided not to go for this approach for the following reasons:
- Much more involved implementation
- Tackles the root cause but with complexity trade-off
🚫Throw an exception instead of warning on duplicate add - We’ve decided not to go for this for the following:
- It requires making some architectural decisions with respect to ordering segment notifications and possibly implementing a rollback
- Complex
- Does not directly tackle the underlying problem
🚫Better intersection of metadata ids with vector segment ids (BF and HNSW) - We’ve decided not to go with this approach even if it is simpler than all the above due to it not tackling the underlying problem directly.
🎉Improve the index membership check of IDs (This is the 1-liner winner solution) /Thanks @HammadB for the perspective shift

Follow-ups

Figure out a way to propagate warnings to the client from Chroma server
Re-evaluate the ordering of subscription notifications e.g. Vector first then Metadata (single-node only)
Consider implementing rollback in embedding queue on a failure (single-node only)
Stricter invariants enforcement - duplicates should raise an exception

Testing

Existing tests fail to catch the error for the following reasons:

Not enough data being generated/added - In tracing the existing test_embeddings.py the observation is that in all state machine iterations the existing embeddings rarely exceed 50 whereas we never configure hnsw:batch_size (defaults to 100) thus making the segment never move vectors to HNSW.
The Test is missing with_persistent_hnsw_params to allow a lower record count to cross the batch_size threshold.
Inherent randomness of hypothesis may never reach the prerequisite conditions

tazarov · 2024-04-25T18:09:17Z

[BUG]: The batch that keeps on batching #2113
[BUG]: The batch, the sync and the missing vector #2062 👈
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @tazarov and the rest of your teammates on Graphite

github-actions · 2024-04-25T18:09:22Z

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

Can you think of any use case in which the code does not behave as intended? Have they been tested?
Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
If appropriate, are there adequate property based tests?
If appropriate, are there adequate unit tests?
Should any logging, debugging, tracing information be added or removed?
Are error messages user-friendly?
Have all documentation changes needed been made?
Have all non-obvious changes been commented?

System Compatibility

Are there any potential impacts on other parts of the system or backward compatibility?
Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

ibratoev · 2024-04-26T12:52:19Z

Nice description!
I would explore also why the existing test suit does not catch such an issue, and improve the tests accordingly.

tazarov · 2024-05-02T06:00:39Z

⚠️But wait, there's more ... TBD

vercel · 2024-05-02T07:01:28Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
chroma	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 14, 2024 2:42pm

HammadB · 2024-05-08T16:21:29Z

chromadb/test/property/invariants.py

@@ -292,3 +296,20 @@ def ann_accuracy(
 # Ensure that the query results are sorted by distance
 for distance_result in query_results["distances"]:
 assert np.allclose(np.sort(distance_result), distance_result)
+
+
+def segments_len_match(api: ServerAPI, collection: Collection) -> None:


HammadB · 2024-05-08T16:21:46Z

chromadb/test/property/strategies.py

@@ -298,8 +298,12 @@ def collections(
 metadata.update(test_hnsw_config)
 if with_persistent_hnsw_params:
 metadata["hnsw:batch_size"] = draw(st.integers(min_value=3, max_value=2000))
+ # batch_size > sync_threshold doesn't make sense


HammadB · 2024-05-08T16:22:12Z

chromadb/test/property/test_embeddings.py

+ metadata=collection.metadata,
+ embedding_function=collection.embedding_function,
+ )
+ except Exception as e:


Whats this doing and why - seems comment-worthy

hnsw:batch is only possibly persisted client/server. So, instead of making a more complex change to the test rig, this is one way to detect whether the client is persisted.

- Added a new test to verify the specific use case we've fixed WARNING: Property tests are expected to fail! There is another bug that will get stacked on top of this.

tazarov mentioned this pull request Apr 26, 2024

[BUG]: WAL replay warnings suppression #1763

Closed

1 task

vercel bot had a problem deploying to Preview May 2, 2024 07:01 Failure

vercel bot deployed to Preview May 2, 2024 07:03 View deployment

tazarov marked this pull request as ready for review May 2, 2024 07:03

tazarov mentioned this pull request May 2, 2024

[BUG]: The batch that keeps on batching #2113

Closed

1 task

vercel bot deployed to Preview May 3, 2024 14:57 View deployment

vercel bot deployed to Preview May 3, 2024 15:00 View deployment

HammadB reviewed May 8, 2024

View reviewed changes

tazarov mentioned this pull request May 12, 2024

[Bug]: Embeddings Deletion Causes "Delete of nonexisting embedding ID" #989

Open

vercel bot deployed to Preview May 14, 2024 12:48 View deployment

vercel bot deployed to Preview May 14, 2024 13:49 View deployment

tazarov added 5 commits May 14, 2024 17:38

[BUG]: The batch, the sync and the missing vector

9ca85f3

feat: Updated/fixed prop tests

de5f87a

- Added a new test to verify the specific use case we've fixed WARNING: Property tests are expected to fail! There is another bug that will get stacked on top of this.

feat: Simpler and more elegant way to deal with the problem (1-liner)

7cc68db

chore: Limiting number of examples as test times out.

5685050

chore: Further reducing examples and setting deadline to None

805bc79

tazarov force-pushed the trayan-04-25-_bug_the_batch_the_sync_and_the_missing_vector branch from b613f2e to 805bc79 Compare May 14, 2024 14:41

vercel bot deployed to Preview May 14, 2024 14:42 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: The batch, the sync and the missing vector #2062

[BUG]: The batch, the sync and the missing vector #2062

tazarov commented Apr 25, 2024 •

edited

tazarov commented Apr 25, 2024 •

edited

github-actions bot commented Apr 25, 2024

ibratoev commented Apr 26, 2024

tazarov commented May 2, 2024

vercel bot commented May 2, 2024 •

edited

HammadB May 8, 2024

HammadB May 8, 2024

HammadB May 8, 2024

tazarov May 8, 2024

	for i, id in enumerate(target_ids):
	if id in ids_bf:
	results.append(self._brute_force_index.get_vectors([id])[0])
	elif id in ids_hnsw and id not in self._curr_batch._deleted_ids:
	hnsw_labels.append(self._id_to_label[id])
	# Placeholder for hnsw results to be filled in down below so we
	# can batch the hnsw get() call
	results.append(None)
	id_to_index[id] = i

	if len(hnsw_labels) > 0 and self._index is not None:
	vectors = cast(Sequence[Vector], self._index.get_items(hnsw_labels))

	for label, vector in zip(hnsw_labels, vectors):
	id = self._label_to_id[label]
	results[id_to_index[id]] = VectorEmbeddingRecord(
	id=id, embedding=vector
	)

	def _notify_all(self, topic: str, embeddings: Sequence[LogRecord]) -> None:
	"""Send a notification to each subscriber of the given topic."""
	if self._running:
	for sub in self._subscriptions[topic]:
	self._notify_one(sub, embeddings)

[BUG]: The batch, the sync and the missing vector #2062

Are you sure you want to change the base?

[BUG]: The batch, the sync and the missing vector #2062

Conversation

tazarov commented Apr 25, 2024 • edited

Description of changes

Test plan

Documentation Changes

Affected issues

Root Cause Analysis

The Detail

How to reproduce

What is affected

What really happens

Key takeaways

Solutions

Follow-ups

Testing

tazarov commented Apr 25, 2024 • edited

github-actions bot commented Apr 25, 2024

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

ibratoev commented Apr 26, 2024

tazarov commented May 2, 2024

vercel bot commented May 2, 2024 • edited

HammadB May 8, 2024

Choose a reason for hiding this comment

HammadB May 8, 2024

Choose a reason for hiding this comment

HammadB May 8, 2024

Choose a reason for hiding this comment

tazarov May 8, 2024

Choose a reason for hiding this comment

tazarov commented Apr 25, 2024 •

edited

tazarov commented Apr 25, 2024 •

edited

vercel bot commented May 2, 2024 •

edited