Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore Miri and Loom for hardening of lock-free code #121

Open
elBoberido opened this issue Feb 18, 2024 · 5 comments
Open

Explore Miri and Loom for hardening of lock-free code #121

elBoberido opened this issue Feb 18, 2024 · 5 comments

Comments

@elBoberido
Copy link
Member

Brief feature description

It's hard to get lock-free code right and one can easily introduce races. Miri and Loom can help to find those races

Detailed information

Miri and Loom can help to harden to lock-free structures. We need to check the limits on these tools and how to integrate them into the CI.

Here is a small example of a FIFO with a wrong memory ordering on line 98: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=7e42e80e68c48de09c7615422eea3c00

@elBoberido elBoberido added this to the lock-free hardening milestone Feb 18, 2024
@hippiehunter
Copy link

hippiehunter commented Feb 21, 2024

I tried building a proof of concept for using loom in the lock-free area. I ran into some issues and figured i might share what I learned here.

  • thread::scope doesn't exist in loom directly. There is a PR to the loom repo that adds it.
  • loom::sync::barrier exists only to allow code to compile it throws at runtime and using std::sync::barrier results in deadlocks for the current tests. These are more or less expected because loom is an intentionally unfair scheduler. I haven't spent enough time to figure out how to make the mpmc tests run properly without the barriers.
  • FixedSizeContainer for non primitive types segfaults during construction and I don't know enough to determine if its a loom problem, a FixedSizeContainer problem, or a problem with how i adapted the code to use loom::cell::UnsafeCell.
  • UnsafeCell The key change as I understand it revolves around not using * or .get() on the UnsafeCell but instead, refactoring to use with and with_mut depending on if you need read or write access to the contents. The required changes weren't bad in the lock-free area but I didn't do any analysis on what it would mean to the rest of the codebase.

@elBoberido
Copy link
Member Author

@hippiehunter thanks for looking into this. Do you have experience with loom and know it's limitations?

I've only some experience with Miri and it seems to work quite good in general but there are also cases where it either generates false positives or I do not understand the underlying problem.

@hippiehunter
Copy link

I don't have any practical experience with loom, it's been on my todo list for a while and this looked like a good place to start. My understanding of its limitations only comes from what I've read in their documentation/issues and random blogs.

I tried running miri on the same test and it seems to be failing for FixedSizeContainer<TestType, CAPACITY> in the same place that loom fails. You might have a better understanding of this error message.

error: Undefined Behavior: trying to retag from <wildcard> for SharedReadWrite permission at alloc35928[0x50], but no exposed tags have suitable permission in the borrow stack for this location
   --> /mnt/d/repos/iceoryx2/iceoryx2-bb/lock-free/src/mpmc/unique_index_set.rs:429:19
    |
429 |             &mut *(*self.data_ptr.as_ptr().offset(index as isize)).get()
    |                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |                   |
    |                   trying to retag from <wildcard> for SharedReadWrite permission at alloc35928[0x50], but no exposed tags have suitable permission in the borrow stack for this location
    |                   this error occurs as part of retag at alloc35928[0x50..0x54]
    |
    = help: this indicates a potential bug in the program: it performed an invalid operation, but the Stacked Borrows rules it violated are still experimental
    = help: see https://github.com/rust-lang/unsafe-code-guidelines/blob/master/wip/stacked-borrows.md for further information
    = note: BACKTRACE:
    = note: inside `iceoryx2_bb_lock_free::mpmc::unique_index_set::UniqueIndexSet::get_next_free_index` at /mnt/d/repos/iceoryx2/iceoryx2-bb/lock-free/src/mpmc/unique_index_set.rs:429:19: 429:67
    = note: inside `iceoryx2_bb_lock_free::mpmc::unique_index_set::UniqueIndexSet::acquire_raw_index` at /mnt/d/repos/iceoryx2/iceoryx2-bb/lock-free/src/mpmc/unique_index_set.rs:358:29: 358:63
    = note: inside `iceoryx2_bb_lock_free::mpmc::unique_index_set::UniqueIndexSet::acquire_with_additional_cleanup::<'_, {closure@iceoryx2_bb_lock_free::mpmc::container::Container<TestType>::add::{closure#0}}>` at /mnt/d/repos/iceoryx2/iceoryx2-bb/lock-free/src/mpmc/unique_index_set.rs:301:18: 301:42
    = note: inside `iceoryx2_bb_lock_free::mpmc::container::Container::<TestType>::add` at /mnt/d/repos/iceoryx2/iceoryx2-bb/lock-free/src/mpmc/container.rs:281:15: 281:75
    = note: inside `iceoryx2_bb_lock_free::mpmc::container::FixedSizeContainer::<TestType, 129>::add` at /mnt/d/repos/iceoryx2/iceoryx2-bb/lock-free/src/mpmc/container.rs:510:18: 510:43
note: inside `mpmc_container::mpmc_container_add_and_remove_elements_works::<TestType>`
   --> iceoryx2-bb/lock-free/tests/mpmc_container_tests.rs:91:25
    |
91  |             let index = sut.add((i * 3 + 1).into());

interestingly while trying to feel this error out, I tried switching to RelocatablePointer<UnsafeCell<AtomicU32>>instead of RelocatablePointer<UnsafeCell<u32>> for UniqueIndexSet.data_ptr. Then I changed the helpers in UniqueIndexSet like this

fn get_next_free_index(&self, index: u32) -> u32 {
        unsafe {
            (*(*self.data_ptr.as_ptr().offset(index as isize)).get()).load(Ordering::Relaxed)
        }
    }
    #[allow(clippy::mut_from_ref)]
    fn set_next_free_index(&self, index: u32, value: u32) {
        #[deny(clippy::mut_from_ref)]
        unsafe {
            (*(*self.data_ptr.as_ptr().offset(index as isize)).get()).store(value, Ordering::Relaxed);
        }
    }

As well as replacing the helper usages where appropriate. This made the tests pass under miri but I don't have any confidence in my ability to reason about Ordering::Relaxed so I fear I might be masking something. Anyway, sorry if I'm just generating noise.

@elfenpiff
Copy link
Contributor

@hippiehunter Wow, that's awesome. I already have my full focus exactly on this lock-free thing. I have a fixed some parts of your bug-report on the branch: iox2-129-fix-missing-connections and wrote some tests that at least very reliable crash for now.

I will try to incorporate your suggestions and let's see if we can get this thing running.

@elBoberido
Copy link
Member Author

Right, UniqueIndexSet has a memory order issue which I need to fix (#120)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants