Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No details for the LMDB crashes mentioned in the README #216

Open
serefarikan opened this issue Jun 19, 2021 · 7 comments
Open

No details for the LMDB crashes mentioned in the README #216

serefarikan opened this issue Jun 19, 2021 · 7 comments

Comments

@serefarikan
Copy link

I've been looking for a rust binding to LMDB and I thought rkv may be a good candidate given it is under mozilla.

However, the recommendations in the README for production uses are quite conservative (full db in memory and synched transactions) and furthermore, there are references to LMDB crashes to be fixed.

It would be great if the readme provided references to what these crashes are, since that statement made my question my positive view of LMDB's stability.

@serefarikan serefarikan changed the title No details for the LMDB crashes mentions in the README No details for the LMDB crashes mentioned in the README Jun 19, 2021
@nyanpasu64
Copy link

Are the bugs in LMDB itself or in the Rust bindings? Do they trigger with single-instance, concurrent reads and writes, or heavy contention?

Is sled more reliable?

@caer
Copy link

caer commented Oct 3, 2022

I saw this issue has been open without activity for a while. A brief search on Bugzilla revealed a related issue.

It seems like the LMDB crashes are caused by attempting to load a corrupt database file. No mention was made of other failure modes, and it was indicated that the crashes are an issue present in the upstream LMDB project. The LMDB authors, Symas, have indicated they won't fix it.

The crash seems to come from a panic within the LMDB API when opining a corrupted DB file, so a possible workaround might be to catch this specific panic (during the opening of the DB file). However, I'm still on-boarding to Rust, so not sure how appropriate something like catch_unwind would be.

Could one of the recent maintainers (@badboy, @saschanaz) confirm if the above failure mode is the one referenced ambiguously in the README?

@nyanpasu64
Copy link

Not a maintainer, but I've since come across a LMDB failure mode of database corruption followed by uncontrolled crashes when opening the corrupt database, in Baloo (not rkv-based). The bug report thread is at https://bugs.kde.org/show_bug.cgi?id=434926. The crash is SIGBUS on Linux (and untested on Windows), so you can't just catch it through catch_unwind alone.

@badboy
Copy link
Member

badboy commented Oct 11, 2022

We used to have links to the crashes we saw (bug 1538539, bug 1538541), not sure why/when we removed them. These are directly in LMDB (as @nyanpasu64 also mentioned), so catch_unwind in Rust is not enough.

Further we're not using LMDB mode anymore (or moving away from it in the few places we still have it enabled), so we neither have more/newer crash data nor any attempts to fix it.

@hyc
Copy link

hyc commented Dec 1, 2022

It seems like the LMDB crashes are caused by attempting to load a corrupt database file. No mention was made of other failure modes, and it was indicated that the crashes are an issue present in the upstream LMDB project. The LMDB authors, Symas, have indicated they won't fix it.

That seems to be a bit of a mischaracterization. We have support in LMDB 1.0 (https://github.com/LMDB/lmdb/tree/mdb.master3) for per-page checksums, and will return an error for corrupted pages. Certainly we can't roll this feature out in LMDB 0.9 since it requires a DB on-disk format change (to leave space for storing the checksums). Aside from that though, we were never pointed at anything that could help identify the cause of the corruptions in the first place. With the code coverage and everything else that is tested in the LMDB codebase, there's no indication that LMDB itself mis-wrote any pages. Plus literally millions of hours of reliable use in countless other projects that have never encountered similar issues.

PS: we attempted to build Baloo to investigate, but executables built from source always crashed for us, prior to even touching any LMDB code.

@caer
Copy link

caer commented Feb 7, 2023

That seems to be a bit of a mischaracterization. We have support in LMDB 1.0 (https://github.com/LMDB/lmdb/tree/mdb.master3) for per-page checksums, and will return an error for corrupted pages. Certainly we can't roll this feature out in LMDB 0.9 since it requires a DB on-disk format change (to leave space for storing the checksums). Aside from that though, we were never pointed at anything that could help identify the cause of the corruptions in the first place. With the code coverage and everything else that is tested in the LMDB codebase, there's no indication that LMDB itself mis-wrote any pages. Plus literally millions of hours of reliable use in countless other projects that have never encountered similar issues.

PS: we attempted to build Baloo to investigate, but executables built from source always crashed for us, prior to even touching any LMDB code.

Thank you for chiming in, @hyc! I looked into one of the BugZilla tickets linked above and saw you replied there as well, but didn't get any tangible feedback from the team (at least, not on that ticket).

I'm now wondering why the original ticket I found indicated the data corruption is an unrecoverable fault in LMDB, instead of a potential usage error. For example, one of the tickets above mentioned LMDB's max key size may have been violated, leading to the corruption; on a super quick check of lmdb-sys, I found at least some transaction write paths that didn't check/enforce a maximum key size. I'm not a LMDB expert, so maybe @hyc could confirm if this kind of issue could lead to data corruption.

Regardless--because the different issues both here and on BugZilla present conflicting views of the situation, it would be great if these could at least be cleared up explicitly as part of this repo's documentation. As a potential user, I'd love to use the Mozilla-maintained rkv over some potentially abandoned or less actively maintained LMDB wrappers in the Rust ecosystem.

@hyc
Copy link

hyc commented Feb 7, 2023

For example, one of the tickets above mentioned LMDB's max key size may have been violated, leading to the corruption;

LMDB will always reject attempts to use a too-large key with MDB_BAD_VALSIZE - it is impossible to feed a special input to LMDB that will cause corruption.

Remember that LMDB is a single-writer database and serialization is enforced with a simple mutex. As such, it is impossible for writer concurrency to cause any race conditions or other memory corruption issues in LMDB. However, if you violate the 1:1 association between threads and transactions, you can easily corrupt LMDB's data structures. That is apparently what has happened in the Mozilla codebase, though we never got sufficient info to identify the root cause.

I suggest if you want a well supported rust wrapper, use https://github.com/meilisearch/heed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants