Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: The batch, the sync and the missing vector #2062

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

tazarov
Copy link
Contributor

@tazarov tazarov commented Apr 25, 2024

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • Specific conditions cause metadata and binary indices to go out of sync and cause an error in get() with vector data being lost

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Affected issues

The following is a list of discord discussions related to this issue

Root Cause Analysis

TLDR: Under specific (note: specific not special) conditions metadata and vector segments go out of sync due to a batching mechanics causing vector data to be lost.

The Detail

A simple scenario: A user adds data to a collection, enough for their data to be moved from bruteforce index to HNSW (e.g. batch_size is exceeded, defaults to 100). At some point, the user decides they need to update a document (already in HNSW) and replace it with a fresh copy, a fairly common use case to make RAG systems useful. Down the line, the user uses delete() to remove the desired document's id and add() to update the new document. Chroma offers upsert(), but given the number of affected issues and discussions. Experience shows some people prefer delete/add mechanics over upsert. At the moment of insertion of the new data, they are greeted with Add of existing embedding ID:. It seems like a warning, but most people, including myself, didn’t think much of it (I even went as far as to create a PR to bypass the warnings in WAL replays - https://github.com/chroma-core/chroma/pull/1763/files). The reality is that underneath the HNSW batching mechanism was silently discarding vector data for recently deleted vectors and thus causing meta and vector segments to go out of sync, leading to the following three types of problems with a subsequent get(include=["embeddings"]) :

  • IndexError: list assignment index out of range
  • TypeError: 'NoneType' object is not subscriptable
  • No error at all but a mismatch in returned data lengths for IDs and embeddings

Note: We’ll cover more on the above errors and why the reason for the inconsistent error scenarios for the same underlying issue

How to reproduce

import chromadb
import shutil

shutil.rmtree("get_vector_test", ignore_errors=True)
client = chromadb.PersistentClient("get_vector_test")

collection = client.get_or_create_collection('test', metadata={"hnsw:batch_size":10, "hnsw:sync_threshold":20})
import uuid

items = [(f"index-id-{i}-{uuid.uuid4()}", i, [0.1] * 2) for i in range(11)]
ids = [item[0] for item in items]
embeddings = [item[2] for item in items]
collection.add(ids=ids,embeddings=embeddings)

print("Working with id: ", ids[0])
collection.delete(ids=[ids[0]])
collection.add(ids=[ids[0]],embeddings=[[1] * 2])
collection.get(include=['embeddings'])

What is affected

The defect affects PersistentClient and Chroma server.

**Why isn't in-memory affected: **

In-memory indices are not affected because the batch is updated and synchronized at the end of each transaction. Consider the following two locations in _write_records of local_hnsw.py

What really happens

Let’s start by visualizing things to illustrate how the defect works:

The happy path

The happy path is the following layout which is a normal vector segment layout. We have some data in HNSW and some in the bruteforce (BF) index. The Batch keeps track of things being added and deleted so that we can sync them happily after a batch_size overflow of the BF.

image

A regular query results in the vector segment would look like this for the above layout:

image

There are two loops in get_vectors() method:

  • for i, id in enumerate(target_ids):
    if id in ids_bf:
    results.append(self._brute_force_index.get_vectors([id])[0])
    elif id in ids_hnsw and id not in self._curr_batch._deleted_ids:
    hnsw_labels.append(self._id_to_label[id])
    # Placeholder for hnsw results to be filled in down below so we
    # can batch the hnsw get() call
    results.append(None)
    id_to_index[id] = i
    initial loop that prefills results from the BF
  • if len(hnsw_labels) > 0 and self._index is not None:
    vectors = cast(Sequence[Vector], self._index.get_items(hnsw_labels))
    for label, vector in zip(hnsw_labels, vectors):
    id = self._label_to_id[label]
    results[id_to_index[id]] = VectorEmbeddingRecord(
    id=id, embedding=vector
    )
    secondary batching loop which fetches a batch of vectors from HNSW and fills them into the result set

When things operate under normal conditions, as seen above, the id_to_index and the results align perfectly.

Now, let’s look at what happens when a vector is removed:

image

The above shows the vector segment layout (state) after a delete() operation. Important fact to observe here that while ID 1 goes into the deleted items in the batch it is not yet removed from the HNSW index (including its metadata held in _id_to_label , _label_to_id and _id_to_seq_id. Keep this in mind it’s important in the next diagram. Sending a get at this stage will return the correct results as HNSW vectors are fetched with IDs coming from the Metadata index (

vectors = vector_segment.get_vectors(ids=vector_ids)
).

The metadata segment is successfully updated to remove the ID from the sqlite tables:

image

So what happens when we add():

image

The WAL (Embedding Queue) works in a pub-sub way where each segment registers for updates. Each time a user adds data to Chroma, the embedding queue distributes that to all segment subscriptions. In single-node Chroma, there are just two segments for each collection:

  • Metadata segment subscription
  • Vector/HNSW local segment subscription

To ensure that your data is safely stored in the segments, Chroma sequentially and synchronously notifies each segment. Sequencing provides no guarantees of ordering which segment gets the update first (

def _notify_all(self, topic: str, embeddings: Sequence[LogRecord]) -> None:
"""Send a notification to each subscriber of the given topic."""
if self._running:
for sub in self._subscriptions[topic]:
self._notify_one(sub, embeddings)
). While this may seem not so relevant it is an important detail when considering the solution to this problem.

As seen in the diagram above the metadata is updated fine as it did not have any references of 1 while the vector segment rejects the update as it can still see the ID in its _id_to_label HNSW metadata. It is important to observe that the rejection does not result in an exception but a mere warning, which in a client/server setup does not even make it to the client.

So here we are - the metadata and vector segments are out of sync. This out-of-sync is not immediately visible other add(), query() etc. all work just fine until you get to get(). That is where you get confronted with the errors above when you also try to include the embeddings (vectors).

But why does this problem surface in three different ways? The answer is deceitfully simple - key arrangement of id_to_index dictionary (

id_to_index: Dict[str, int] = {}
)

The arrangement largely depends on the IDs used; in our experiments, we used UUIDv4 which appears to be the more common approach people take to generating IDs in Chroma. The inherent random nature of uuids makes key ordering within id_to_index unpredictable. In our experimentation, we’ve observed the following three states of the keys within id_to_index:

image

As exhibited by the diagram in the out-of-sync layout of vector segment the baseline IDs come from metadata segment but the color coding indicates which subset of the vector segment they belong to - BF or HNSW or missing (in red) if in neither.

In (1), the missing ID is at the beginning, so the batch ID fetching and assignment in results are not affected, which lets the results to surface in SegmentAPI where an error TypeError: 'NoneType' object is not subscriptable is thrown as the first item in results is None.

In (2), the missing ID is somewhere in the middle of the keys so an error IndexError: list assignment index out of range within the vector segment during the batch fetching and assignment of results.

In (3), the missing ID is at the end of the id_to_index keys, which lets the missing result go through both the vector segment and SegmentAPI completely unnoticed and results with incongruent cardinalities returned to the client.

Note: It is possible that we missed a corner case of (1) where there is a different ordering of BF and HNSW keys. However, in our experimentation, we have not observed any other than the aforementioned errors. The conclusion is that if there are some sub-cases of (1) or even any of the other two scenarios, they result in the same error set.

Here’s the error distribution of the errors:

image

Key takeaways

  • The defect is easy to reproduce
  • The inconsistency of errors (although distribution heavily leans in the IndexError ) makes the issue a bit difficult to diagnose, especially on unlucky distributions of uuids or whatever IDs are being used in tests
  • Metadata and vector segments go out of sync
  • Use of upsert() on a missing ID fixes the out-of-sync for that record
  • query(), which relies on metadata pre-filtering and HNSW filtering, does not appear to be affected by an execution error. However, User expectations might not be met given the document is visible in Chroma, but a search for a similar or exact item does not appear to match it
  • Original data can be recovered from WAL but requires specialized tooling
  • The extent of the issue is hard to estimate as it has existed ever since 0.4.x

Solutions

We’ve explored four possible solutions as follows:

  • 🚫Change in delete mechanics to bypass batching - immediately remove vectors from HNSW and its metadata - after discussion with @HammadB, we decided to keep the batch semantics as immediate removal makes it harder to reason about.
  • 🚫Adding dict(s) to track deleted vectors and labels and the associated implementation to filter them out from get() and query() - We decided not to go for this approach for the following reasons:
    • Much more involved implementation
    • Tackles the root cause but with complexity trade-off
  • 🚫Throw an exception instead of warning on duplicate add - We’ve decided not to go for this for the following:
    • It requires making some architectural decisions with respect to ordering segment notifications and possibly implementing a rollback
    • Complex
    • Does not directly tackle the underlying problem
  • 🚫Better intersection of metadata ids with vector segment ids (BF and HNSW) - We’ve decided not to go with this approach even if it is simpler than all the above due to it not tackling the underlying problem directly.
  • 🎉Improve the index membership check of IDs (This is the 1-liner winner solution) /Thanks @HammadB for the perspective shift

Follow-ups

  • Figure out a way to propagate warnings to the client from Chroma server
  • Re-evaluate the ordering of subscription notifications e.g. Vector first then Metadata (single-node only)
  • Consider implementing rollback in embedding queue on a failure (single-node only)
  • Stricter invariants enforcement - duplicates should raise an exception

Testing

Existing tests fail to catch the error for the following reasons:

  • Not enough data being generated/added - In tracing the existing test_embeddings.py the observation is that in all state machine iterations the existing embeddings rarely exceed 50 whereas we never configure hnsw:batch_size (defaults to 100) thus making the segment never move vectors to HNSW.
  • The Test is missing with_persistent_hnsw_params to allow a lower record count to cross the batch_size threshold.
  • Inherent randomness of hypothesis may never reach the prerequisite conditions

Copy link
Contributor Author

tazarov commented Apr 25, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @tazarov and the rest of your teammates on Graphite Graphite

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@ibratoev
Copy link
Contributor

Nice description!
I would explore also why the existing test suit does not catch such an issue, and improve the tests accordingly.

@tazarov
Copy link
Contributor Author

tazarov commented May 2, 2024

⚠️But wait, there's more ... TBD

img

Copy link

vercel bot commented May 2, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
chroma ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 14, 2024 2:42pm

@@ -292,3 +296,20 @@ def ann_accuracy(
# Ensure that the query results are sorted by distance
for distance_result in query_results["distances"]:
assert np.allclose(np.sort(distance_result), distance_result)


def segments_len_match(api: ServerAPI, collection: Collection) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@@ -298,8 +298,12 @@ def collections(
metadata.update(test_hnsw_config)
if with_persistent_hnsw_params:
metadata["hnsw:batch_size"] = draw(st.integers(min_value=3, max_value=2000))
# batch_size > sync_threshold doesn't make sense
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

metadata=collection.metadata,
embedding_function=collection.embedding_function,
)
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats this doing and why - seems comment-worthy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hnsw:batch is only possibly persisted client/server. So, instead of making a more complex change to the test rig, this is one way to detect whether the client is persisted.

@tazarov tazarov force-pushed the trayan-04-25-_bug_the_batch_the_sync_and_the_missing_vector branch from b613f2e to 805bc79 Compare May 14, 2024 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants