You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Last sprint, we had a bunch of database corruptions going on that were really hard to implement.
One way we avoided this kind of issue on other projects was by running a complete scan of the database in the tests after each insertion/commit.
Milli contains a lot of databases, but we should implement the same kind of function.
I believe @dureuill already made something like that for the filters. It could be a good first step to integrate it into our test suite.
And then grows the number of checks.
That means I can come from the root of a tree and follow every node till I reach every leaf without ever encountering an unknown node or item ID.
All the item IDs are used exactly once in every tree
While traversing a tree, I also ensure that every item ID the user provides is in the tree.
So, there is no extraneous ID and no missing ID.
The node of one tree cannot be used in another tree
A tree should never contain a reference to another tree; if that happens, it means something got corrupted somewhere, and we re-used a wrong node ID in an incremental build process.
It could be really hard to fix if caught too late.
Nothing is left unknown in the database
After checking all of that, I just went through an exhaustive list of everything in the database.
If I find anything else in the database, that’s a bug. It means that some nodes are leaked in the database, and over multiple indexing processes, the database could grow for no reason or, worse, cause corruption.
Last sprint, we had a bunch of database corruptions going on that were really hard to implement.
One way we avoided this kind of issue on other projects was by running a complete scan of the database in the tests after each insertion/commit.
Milli contains a lot of databases, but we should implement the same kind of function.
I believe @dureuill already made something like that for the filters. It could be a good first step to integrate it into our test suite.
And then grows the number of checks.
This strategy has already been implemented in the index scheduler and arroy.
I think the most aggressive checks are made in arroy, here’s the code: https://github.com/meilisearch/arroy/blob/19e0a07d40fd2b7685b70c941fee00400e9dda24/src/reader.rs#L361-L441
But as a TLDR, here’s what I actually check:
All the trees must be valid
That means I can come from the root of a tree and follow every node till I reach every leaf without ever encountering an unknown node or item ID.
All the item IDs are used exactly once in every tree
While traversing a tree, I also ensure that every item ID the user provides is in the tree.
So, there is no extraneous ID and no missing ID.
The node of one tree cannot be used in another tree
A tree should never contain a reference to another tree; if that happens, it means something got corrupted somewhere, and we re-used a wrong node ID in an incremental build process.
It could be really hard to fix if caught too late.
Nothing is left unknown in the database
After checking all of that, I just went through an exhaustive list of everything in the database.
If I find anything else in the database, that’s a bug. It means that some nodes are leaked in the database, and over multiple indexing processes, the database could grow for no reason or, worse, cause corruption.
This check is called everywhere
In the end, in arroy, every time I update a database, I snapshot its content on disk just to be sure it never changes in an unexpected way.
And the function in charge of snapshotting the database calls the assert validity function:
https://github.com/meilisearch/arroy/blob/19e0a07d40fd2b7685b70c941fee00400e9dda24/src/tests/mod.rs#L41
The text was updated successfully, but these errors were encountered: