New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge property buckets #4729
Draft
donomii
wants to merge
71
commits into
main
Choose a base branch
from
more-merge-buckets-rebase-main-merge-2
base: main
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Merge property buckets #4729
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…te/weaviate into more-merge-buckets-rebase
This reverts commit 5f56a7f.
donomii
changed the title
More merge buckets rebase main merge 2
Merge property buckets
May 23, 2024
Quality Gate failedFailed conditions |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What's being changed:
What's being changed:
Due to the fact that these changes are very large and touch most parts of the (sparse) search, I'm writing up the changes to make them more comprehensible.
As usual, I attempted to make the minimum changes that would deliver the goal of merging the buckets. There were many places where I noticed that it would be possible to improve code by refactoring, but if I did, this would have turned into a complete rewrite of the sparse search. That might be a good thing, but it wasn't part of the task.
I did make the following notable changes.
All property (and internal property) buckets were merged based on their type. Filterable buckets were merged into "filterable_buckets", searchable buckets were merged into "searchable_buckets". The objectsLSM buckets was left entirely unchanged, as was any code that accesses it directly.
As a result of merging the buckets, it is no longer possible to check for an index by looking for a file. Going forward, all checks for configuration must consult the schema. There are no other ways to divine the current settings.
All indexes are available, by default, because they are all stored in the same bucket. So e.g. lengths, timestamps, etc, can be read or written at any time. The only question is if they were active during data load, which is why it is necessary to consult the schema.
To keep the properties separate in the merged buckets, weaviate now creates postfixes for keys. Because property names can be large, each property is assigned a number, then the byte wise representation of that number is used as a postfix for the key. To make this manageable, I created a bucketProxy class, which automatically adds the postfix before passing the call to the real bucket store.
All property access must go through a BucketProxy. A BucketProxy wraps a bucket and a prefix, and modifies the key to retrieve only the correct properties. The object store is completely unchanged, and you still access this directly.
A new adjunct index called "propids" was created to map property names to property ids. BucketProxy only works with id numbers, not the property names.
Any damage to the property id index will corrupt or lose data from the main indexes, i.e. filterable_properties and searchable_properties.
Further work:
Across the whole weaviate codebase, code should always and only check settings by looking them up in the schema.
The schema is the only place that has the correct information. Looking anywhere else risks running the wrong code due to a second error somewhere else in the code. In particular, do not check for the presence of a file to see if the schema has an index. This results in situations where, if the file can't be found, weaviate will run the wrong code, but when the user checks, the schema shows the correct information.
Especially do not attempt to detect the type of the bucket from its filename. If you need to know the type, look in the schema. If you don't have access to the schema, then you need to refactor the code until you do have access to the schema.
One effect of this is that there are tests that expect a fail when they try to load e.g. the length bucket and it isn't there. I had to remove them because we now always have a bucket to record the lengths, it's the merged bucket. The only question is, did we record the lengths when the data was loaded? The answer can only be in the schema.
Stop doing in place modifications
I found at least one instance of corruption caused by modifying values on a structure, there are certainly more possibilities in the code. Unless the function is obviously the owner of the struct, it shouldn't modify it.
Use types
Currently there are multiple types of buckets, which function differently and are incompatible. They should be put in their own separate classes, so that they cannot be used in the wrong place. The presence of functions like CursorRoaringSetKeyOnly() is a clear code smell.
pass the schema everywhere
Almost every part of the code needs to have access to the schema to make decisions about how to process data, so that should be available. It should be available in a form that is easier to access than the current situation, where even figuring out which properties and their types, are present is a significant challenge.
Review checklist