Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge property buckets #4729

Draft
wants to merge 71 commits into
base: main
Choose a base branch
from
Draft

Conversation

donomii
Copy link
Contributor

@donomii donomii commented Apr 19, 2024

What's being changed:

What's being changed:
Due to the fact that these changes are very large and touch most parts of the (sparse) search, I'm writing up the changes to make them more comprehensible.

As usual, I attempted to make the minimum changes that would deliver the goal of merging the buckets. There were many places where I noticed that it would be possible to improve code by refactoring, but if I did, this would have turned into a complete rewrite of the sparse search. That might be a good thing, but it wasn't part of the task.

I did make the following notable changes.

All property (and internal property) buckets were merged based on their type. Filterable buckets were merged into "filterable_buckets", searchable buckets were merged into "searchable_buckets". The objectsLSM buckets was left entirely unchanged, as was any code that accesses it directly.

As a result of merging the buckets, it is no longer possible to check for an index by looking for a file. Going forward, all checks for configuration must consult the schema. There are no other ways to divine the current settings.

All indexes are available, by default, because they are all stored in the same bucket. So e.g. lengths, timestamps, etc, can be read or written at any time. The only question is if they were active during data load, which is why it is necessary to consult the schema.

To keep the properties separate in the merged buckets, weaviate now creates postfixes for keys. Because property names can be large, each property is assigned a number, then the byte wise representation of that number is used as a postfix for the key. To make this manageable, I created a bucketProxy class, which automatically adds the postfix before passing the call to the real bucket store.

All property access must go through a BucketProxy. A BucketProxy wraps a bucket and a prefix, and modifies the key to retrieve only the correct properties. The object store is completely unchanged, and you still access this directly.

A new adjunct index called "propids" was created to map property names to property ids. BucketProxy only works with id numbers, not the property names.

Any damage to the property id index will corrupt or lose data from the main indexes, i.e. filterable_properties and searchable_properties.

Further work:

Across the whole weaviate codebase, code should always and only check settings by looking them up in the schema.
The schema is the only place that has the correct information. Looking anywhere else risks running the wrong code due to a second error somewhere else in the code. In particular, do not check for the presence of a file to see if the schema has an index. This results in situations where, if the file can't be found, weaviate will run the wrong code, but when the user checks, the schema shows the correct information.

Especially do not attempt to detect the type of the bucket from its filename. If you need to know the type, look in the schema. If you don't have access to the schema, then you need to refactor the code until you do have access to the schema.

One effect of this is that there are tests that expect a fail when they try to load e.g. the length bucket and it isn't there. I had to remove them because we now always have a bucket to record the lengths, it's the merged bucket. The only question is, did we record the lengths when the data was loaded? The answer can only be in the schema.

Stop doing in place modifications
I found at least one instance of corruption caused by modifying values on a structure, there are certainly more possibilities in the code. Unless the function is obviously the owner of the struct, it shouldn't modify it.

Use types
Currently there are multiple types of buckets, which function differently and are incompatible. They should be put in their own separate classes, so that they cannot be used in the wrong place. The presence of functions like CursorRoaringSetKeyOnly() is a clear code smell.

pass the schema everywhere
Almost every part of the code needs to have access to the schema to make decisions about how to process data, so that should be available. It should be available in a form that is easier to access than the current situation, where even figuring out which properties and their types, are present is a significant challenge.

Review checklist

  • Documentation has been updated, if necessary. Link to changed documentation:
  • Chaos pipeline run or not necessary. Link to pipeline:
  • All new code is covered by tests where it is reasonable.
  • Performance tests have been run or not necessary.

@donomii donomii self-assigned this Apr 19, 2024
@donomii donomii changed the title More merge buckets rebase main merge 2 Merge property buckets May 23, 2024
Copy link

sonarcloud bot commented May 23, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
10.2% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant