Merge property buckets #4729

donomii · 2024-04-19T14:04:23Z

What's being changed:

What's being changed:
Due to the fact that these changes are very large and touch most parts of the (sparse) search, I'm writing up the changes to make them more comprehensible.

As usual, I attempted to make the minimum changes that would deliver the goal of merging the buckets. There were many places where I noticed that it would be possible to improve code by refactoring, but if I did, this would have turned into a complete rewrite of the sparse search. That might be a good thing, but it wasn't part of the task.

I did make the following notable changes.

All property (and internal property) buckets were merged based on their type. Filterable buckets were merged into "filterable_buckets", searchable buckets were merged into "searchable_buckets". The objectsLSM buckets was left entirely unchanged, as was any code that accesses it directly.

As a result of merging the buckets, it is no longer possible to check for an index by looking for a file. Going forward, all checks for configuration must consult the schema. There are no other ways to divine the current settings.

All indexes are available, by default, because they are all stored in the same bucket. So e.g. lengths, timestamps, etc, can be read or written at any time. The only question is if they were active during data load, which is why it is necessary to consult the schema.

To keep the properties separate in the merged buckets, weaviate now creates postfixes for keys. Because property names can be large, each property is assigned a number, then the byte wise representation of that number is used as a postfix for the key. To make this manageable, I created a bucketProxy class, which automatically adds the postfix before passing the call to the real bucket store.

All property access must go through a BucketProxy. A BucketProxy wraps a bucket and a prefix, and modifies the key to retrieve only the correct properties. The object store is completely unchanged, and you still access this directly.

A new adjunct index called "propids" was created to map property names to property ids. BucketProxy only works with id numbers, not the property names.

Any damage to the property id index will corrupt or lose data from the main indexes, i.e. filterable_properties and searchable_properties.

Further work:

Across the whole weaviate codebase, code should always and only check settings by looking them up in the schema.
The schema is the only place that has the correct information. Looking anywhere else risks running the wrong code due to a second error somewhere else in the code. In particular, do not check for the presence of a file to see if the schema has an index. This results in situations where, if the file can't be found, weaviate will run the wrong code, but when the user checks, the schema shows the correct information.

Especially do not attempt to detect the type of the bucket from its filename. If you need to know the type, look in the schema. If you don't have access to the schema, then you need to refactor the code until you do have access to the schema.

One effect of this is that there are tests that expect a fail when they try to load e.g. the length bucket and it isn't there. I had to remove them because we now always have a bucket to record the lengths, it's the merged bucket. The only question is, did we record the lengths when the data was loaded? The answer can only be in the schema.

Stop doing in place modifications
I found at least one instance of corruption caused by modifying values on a structure, there are certainly more possibilities in the code. Unless the function is obviously the owner of the struct, it shouldn't modify it.

Use types
Currently there are multiple types of buckets, which function differently and are incompatible. They should be put in their own separate classes, so that they cannot be used in the wrong place. The presence of functions like CursorRoaringSetKeyOnly() is a clear code smell.

pass the schema everywhere
Almost every part of the code needs to have access to the schema to make decisions about how to process data, so that should be available. It should be available in a form that is easier to access than the current situation, where even figuring out which properties and their types, are present is a significant challenge.

Review checklist

Documentation has been updated, if necessary. Link to changed documentation:
Chaos pipeline run or not necessary. Link to pipeline:
All new code is covered by tests where it is reasonable.
Performance tests have been run or not necessary.

…te/weaviate into more-merge-buckets-rebase

This reverts commit 5f56a7f.

sonarcloud · 2024-05-23T13:18:42Z

Quality Gate failed

Failed conditions
10.2% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

donomii added 20 commits October 4, 2023 19:49

Merge buckets

819799b

Regenerate reformat

863a9ec

Merge buckets

2283acf

Regenerate reformat

f21e635

Merge branch 'more-merge-buckets-rebase' of https://github.com/weavia…

9ed7033

…te/weaviate into more-merge-buckets-rebase

Fixed spurios error messages

2a2e232

Merge branch 'main' into more-merge-buckets-rebase-main-merge-2

f2f0838

Merge shard.go

a9ba776

Fix more

9fc2f68

Revert "refact: thread safety for bucket creation and loading (#4422)"

c1bd143

This reverts commit 5f56a7f.

Moar

f685283

Fix merge

74b5c1e

fix moar

fa4d272

.

7e2f03d

.

fd1da59

.

69888c4

.

89b81f8

.

577e074

.

7f4355d

.

b987a36

donomii self-assigned this Apr 19, 2024

donomii added 9 commits April 20, 2024 01:11

migrator

9a60665

.

2a1720d

.

1a68937

.

6315376

.

c7cbbc1

Merge branch 'main' into more-merge-buckets-rebase-main-merge-2

3293cbf

Regenerate reformat

3b1dd35

.

ded6ee5

Regenerate reformat

e8ebe2d

donomii added 27 commits April 22, 2024 22:50

.

b037ac9

Regenerate reformat

722bdea

Regenerate reformat

63fdd2c

Merge branch 'main' into more-merge-buckets-rebase-main-merge-2

6963dd8

.

be8bba8

Regenerate reformat

22b79a5

Fix more tests

17d3a91

Merge branch 'main' into more-merge-buckets-rebase-main-merge-2

4defd4c

Merge branch 'main' into more-merge-buckets-rebase-main-merge-2

0a777c7

Merge branch 'main' into more-merge-buckets-rebase-main-merge-2

e477f3e

Fix again

c60c340

Remove debugging

8faebd2

Use var for go test

c00ab9f

Add new cursors. Rename helpers, add prefix/postfix funcs

b943a51

.

4889532

.

98938ed

.

7b2c868

.

b20d7c2

.

0c08895

.

0f293c8

.

437b9d2

.

6411d27

.

a99b25d

.

93af236

Merge branch 'main' into more-merge-buckets-rebase-main-merge-2

6d0dc58

Regenerate reformat

9def0c4

Fix init for old buckets code

e336824

donomii changed the title ~~More merge buckets rebase main merge 2~~ Merge property buckets May 23, 2024

Merge branch 'main' into more-merge-buckets-rebase-main-merge-2

5f3be80

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge property buckets #4729

Merge property buckets #4729

donomii commented Apr 19, 2024 •

edited

sonarcloud bot commented May 23, 2024

Merge property buckets #4729

Are you sure you want to change the base?

Merge property buckets #4729

Conversation

donomii commented Apr 19, 2024 • edited

What's being changed:

Review checklist

sonarcloud bot commented May 23, 2024

Quality Gate failed

donomii commented Apr 19, 2024 •

edited