New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock when using nested tasks + semaphore #506
Comments
A slight variation on the above test which makes it a bit more tidy but still deadlocks TEST_CASE("Semaphore.Deadlock")
{
using namespace std::chrono_literals;
for (size_t i = 0; i < 10000; ++i)
{
tf::CriticalSection critical(1);
auto lazy_data = [&](tf::Subflow &rt)
{
auto task = rt.emplace([&](tf::Subflow&rt)
{
// (A) If this line is commented out then the test passes
rt.emplace([&] { });
rt.join();
});
// (B) If this line is commented out then the test passes
critical.add(task);
rt.join();
};
tf::Taskflow flow;
for (size_t k = 0; k < 16; ++k)
{
auto task = flow.emplace(
[&](tf::Subflow&rt)
{
lazy_data(rt);
});
}
std::cout << i << std::endl;
// If the following line fails - you are experiencing deadlock
REQUIRE(executor.run(flow).wait_for(5s) == std::future_status::ready);
}
} If either (A) or (B) is commented out it deadlocks. |
The pull request is just for the failing test. No solution is provided. taskflow/unittests/test_semaphores.cpp Lines 213 to 255 in 9395abf
|
The problem happens because In practice, we do not recommend using |
How then to do critical sections within recursive subflows. One cannot use mutexes because of the potential to deadlock the thread pool. |
I think in our particular case we can redesign our functionality to only start one initializing task instead of multiple tasks, and the rest of workers will be just in corun_until until initialization is complete. This will protect us from deadlocks because there will be no way to run this initializing task recursively, because there will be only one task. But I am still puzzled about:
I dont understand in which sense they are not thread-safe? They are constructed in a single thread and after that we must be able to use them from any amount of workers we like, no? Or is it some limitation that does not let us enqueue tasks associated with mutexes concurrently? |
Sorry for confusion - For |
@tsung-wei-huang Thanks, now I understand. |
@tsung-wei-huang Sorry, I looked closer into sources and I think I don't understand where this race condition is. Update
It seems I was wrong, we still need to use some kind of tf::CriticalSection to implement lazy initialization reliably. |
Let me show the full source code and unit test for our Lazy class. @tsung-wei-huang can then see more clearly what we are trying to achieve. The key observations are that:
// Copyright 2024 Moduleworks gmbh
#pragma once
#include <variant>
#pragma warning(push)
#pragma warning(disable : 4324 4456 5046)
#include <atomic>
#include <taskflow/taskflow.hpp>
#pragma warning(pop)
namespace mw::tf
{
enum class LazyProtocol
{
/// Evaluate and store the result lazily
Lazy,
/// Evaluate and store the result eagerly
Eager,
/// Never store the result. Recalc every time
NoCache
};
/// Provides lazy initialization of a constant under taskflow.
/// It uses taskflow semaphores to control access to the critical
/// section rather than std::mutex and std::scoped_lock which
/// may deadlock or cause poor performance for the taskflow
/// scheduler.
template <class T, LazyProtocol TProtocol = LazyProtocol::Lazy>
class Lazy
{
/// The internal implementation that all instances share
struct LazyImpl
{
/// The function used to generate the result
const std::function<T()> m_fn;
/// Critical section control
::tf::Semaphore m_semaphore;
/// The cached result
std::variant<std::monostate, T, std::exception_ptr> m_data = std::monostate{};
/// Atomic flag to declare the result is ready
std::atomic<bool> m_has_value;
LazyImpl(std::function<T()> f) : m_fn(f), m_semaphore(1), m_data(std::monostate{})
{
if constexpr (TProtocol == LazyProtocol::Eager)
{
Update();
CheckHeldException();
}
}
LazyImpl(LazyImpl const&) = delete;
LazyImpl(LazyImpl&&) = delete;
private:
void CheckHeldException()
{
if (std::holds_alternative<std::exception_ptr>(m_data))
std::rethrow_exception(std::get<std::exception_ptr>(m_data));
}
void Update()
{
if (!m_has_value.load(std::memory_order_acquire))
{
try
{
m_data = m_fn();
}
catch (...)
{
m_data = std::current_exception();
}
m_has_value.store(true, std::memory_order_release);
}
}
public:
T* get()
{
if constexpr (TProtocol == LazyProtocol::NoCache)
{
return m_fn();
}
else
{
if constexpr (TProtocol == LazyProtocol::Lazy)
{
// If we already have a value, return it
if (!m_has_value.load(std::memory_order_acquire))
{
// Otherwise, calculate it in a task so that don't block
// the current worker thread.
::tf::Taskflow taskflow;
const TaskContextSaver tcSaver;
auto task = taskflow.emplace(
[this, &tcSaver]()
{
// Set FP and thread priority
TaskContextApplier tcApplier(tcSaver);
// Initialize value
Update();
});
task.acquire(m_semaphore);
task.release(m_semaphore);
mw::tf::Schedule(taskflow);
}
}
// At this point we hold either an exception or a value
CheckHeldException();
return &std::get<T>(m_data);
}
}
};
public:
/// Pass a nullary (factory) function to be evaluated later.
/// @param f nullary (factory) function to generate the value. Will be called only once
/// @param ex the taskflow executor to schedule the task which generates the value
template <typename Function>
requires std::is_invocable_r_v<T, Function> Lazy(Function f)
: m_impl(std::make_shared<LazyImpl>(f))
{
/// Returning a raw pointer here is bad behaviour
/// as it is not clear at all who owns the value
/// and who is responsible for deleting it.
static_assert(!std::is_pointer_v<T>, "Factory function should not return a raw pointer");
}
/// Get the value. May suspend the current task and schedule another while lazy value is already
/// being calculated from another location
T const& operator*() const { return *m_impl->get(); }
/// Get the value. May suspend the current task and schedule another while lazy value is already
/// being calculated from another location
T const* operator->() const { return m_impl->get(); }
/// Returns true if the result has been calculated
operator bool() const { return m_impl->has_value(); }
private:
std::shared_ptr<LazyImpl> m_impl;
};
} // namespace mw::tf and the original test case showing the deadlock. TEST(Taskflow, Lazy)
{
using namespace std::chrono_literals;
tf::Executor executor(8); // create an executor of 8 workers
/// proof that initialization only occurs once
std::atomic<int> count0 = 0;
std::atomic<int> count1 = 0;
mw::tf::Lazy<int> data(
[&]()
{
count0++;
return 99;
});
auto job = [&]()
{
EXPECT_EQ(*data, 99);
tf::Taskflow taskflow;
taskflow
.for_each_index(
0,
100,
1,
[&]([[maybe_unused]] int i)
{
EXPECT_EQ(*data, 99);
std::this_thread::sleep_for(5ms);
count1++;
},
::tf::StaticPartitioner(1))
.name("loop");
executor.corun(taskflow);
};
tf::Taskflow taskflow;
taskflow.emplace(
[&]()
{
tf::Taskflow taskflow;
std::vector<tf::Task> tasks{
taskflow.emplace(job).name("job 1"),
taskflow.emplace(job).name("job 2"),
taskflow.emplace(job).name("job 3"),
taskflow.emplace(job).name("job 4"),
taskflow.emplace(job).name("job 5")};
executor.corun(taskflow);
EXPECT_EQ(count0.load(), 1);
EXPECT_EQ(count1.load(), 500);
EXPECT_EQ(*data, 99);
});
executor.run(taskflow).wait();
} |
@olologin , if your examples, you will have multiple subflows that run simultaneously, while each of them has a task to acquire the semophore/critical section. Basically, I will be happy to do a conference call if that works better for you :) feel free to reach out at [email protected] |
I'm not sure the point you are trying to make. We can assume that you are correct in stating that you have a race condition with tf::Semaphore and nested subflows. However I'm not particularly caring of what mechanism we use to achieve this as long as it works. The requirement is for lazily cached data where the factory function to generate the data uses nested subflows. We have found that this does not work using std::scoped_lock because you can lock the mutex twice on the same thread. Using recursive_mutex is not the answer as you still end up with two tasks inside the critical section. tf::Semaphore and launching a task to do the factory operation seems to be the right semantics but as you state there is a design problem with tf::Semaphore that prevents what we are trying to do. In summary
Can you suggest a way forward here? |
I am thinking to redesign the current semaphore to support more dynamic and agile usage. For instance:
Instead of describing which task to acquire/release a semaphore from the |
That seems quite nice because it also solves the problem people are asking
for for how to suspend a task.
What happens if all tasks are waiting on a semaphore? Would it be possible
to signal that semaphore from outside taskflow IE from another thread
outside the pool? Then taskflow could interact with IO yes?
…On Sat, 14 Oct 2023, 05:55 Tsung-Wei Huang, ***@***.***> wrote:
I am thinking to redesign the current semaphore to support more dynamic
and agile usage. For instance:
tf::Semaphore2 semaphore2
taskflow.emplace([&](tf::Runtime& rt){
// do something
// ...
tf.acquire(semaphore2);
// do another thing
// ...
});
taskflow.emplace([&](tf::Runtime& rt){
// do something
// ...
tf.release(semaphore2);
// do another thing
// ...
});
Instead of describing which task to acquire/release a semaphore from the
tf::Task handle, the two new methods tf::Runtime::acquire and
tf::Runtime::release allow the current task to acquire and release any
semaphores while interacting with the current scheduler immediately.
Thought?
—
Reply to this email directly, view it on GitHub
<#506 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAEJ4TTX4JOZZGVR2JB6A3X7IEMBAVCNFSM6AAAAAA4U3T3X2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRSGU2TANBRGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
As I understand this will also make semaphores thread-safe? I guess this will resolve our current usage scenario, would be great to have this. |
@olologin yes, this will need to be thread-safe. |
Once two tasks (suppose task A and task B) try to acquire one same semaphore and A may get the semaphore first and run for a long time. Currently task B may occupy a thread until task A release the semaphore(or even later). If we have a coroutine pool to repleace the thread pool will be solve this problem mentioned before, right? |
I would guess that tf.acquire(...) does not call a system lock and block
the thread. It yields to the scheduler which can then pick another ready to
run task.
…On Mon, Oct 23, 2023 at 3:32 PM Kirai ***@***.***> wrote:
I am thinking to redesign the current semaphore to support more dynamic
and agile usage. For instance:
tf::Semaphore2 semaphore2
taskflow.emplace([&](tf::Runtime& rt){
// do something
// ...
tf.acquire(semaphore2);
// do another thing
// ...
});
taskflow.emplace([&](tf::Runtime& rt){
// do something
// ...
tf.release(semaphore2);
// do another thing
// ...
});
Instead of describing which task to acquire/release a semaphore from the
tf::Task handle, the two new methods tf::Runtime::acquire and
tf::Runtime::release allow the current task to acquire and release any
semaphores while interacting with the current scheduler immediately.
Thought?
Once two tasks (suppose task A and task B) try to acquire one same
semaphore and A may get the semaphore first and run for a long time.
Currently task B may occupy a thread until task A release the semaphore. If
we have a coroutine pool to repleace the thread pool will be solve this
problem mentioned before, right?
—
Reply to this email directly, view it on GitHub
<#506 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAEJ4QGXGWBD7ZQV5NYUOTYAZWW7AVCNFSM6AAAAAA4U3T3X2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZVGIYDKOJSHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@bradphelan Could you please give some more details of this design you mentioned? The task will never give up the occupation of a thread because of using yield. The workload only releases the time slice not the thread. It will also always try to run on CPU and cause runtime waste(the frequency depends on the implementation of scheduler I think). The worst case is that all workers are occupied by such kind of tasks. So I think the only way to solve it(i mean to implement the embedding-like semaphore) is to use coroutine. |
It's part of the current design of taskflow. If a task calls Executor.corun then the tasks yields to the schedular and corun doesn't return until the schedular decides that the task is ready to run again. I'm not sure what you mean
There is no time slicing. It is not a real time schedular and not a pre-emptive schedular. Tasks run until they finish or yield. note: I'm using the term yield I'm not sure this is the term @tsung-wei-huang would be using. Please correct my terminology if this is confusing. In some ways taskflow is a bit like using coroutines but not quite. Each task uses the stack of the thread it runs on. This can be quite confusing when looking at the debugger. Tasks that do not seem to be recursive can appear that way in the debugger because multiple tasks can share the same stack. |
Hi @bradphelan thank you for your information. I'm talking about this possible semaphore design: tf::Semaphore2 semaphore2
taskflow.emplace([&](tf::Runtime& rt){
// do something
// ...
tf.acquire(semaphore2);
// do another thing
// ...
});
taskflow.emplace([&](tf::Runtime& rt){
// do something
// ...
tf.release(semaphore2);
// do another thing
// ...
}); In my opinion, the acquire/release may stuck the process of worker thread if we use the current thread pool model. |
@VincentXWD tf.aquire() could be implemented the same as corun. If the semaphore is free aquire simply returns. If the semaphore is not free the scheduler takes the next ready task and runs it. When that task is finished the semaphore is checked again and if it is free then the schedular simply returns. If it is not free then the process repeats. |
One of our guys debugged this sample in detail and it seems we were all wrong. It is not data-race in semaphore vector, and it is not going to get solved by semaphore reimplementation. What is actually happening:
If we debug this, we see that C0 is properly executed, but we get into a deadlock because we have this stacktrace in one of our workers: Since this It seems we are using semaphore/mutex incorrectly here, but there is no good way to reimplement lazy_init currently without blocking all but 1 thread.
|
Btw, this is how it looks like in Intel TBB: |
We will try to implement task isolation soon, because one of our developers is interested in implementing it, and we will propose PR if we make it work :) Idea so far is to create a separate API class TaskArena which will be owning a separate TaskQueue. Pseudo C++ code of additions to the API (This is not final):
Example of usage:
|
I think this is a genius design! Perhaps TaskArena can be created from an executor with either unique or shared ownership? In this case, TaskArean can be initialized with information available in executor (e.g., number of workers). |
Describe the bug
With some low probability the following unittest deadlocks instead of completing.
It seems tf somehow "forgets" to execute the most nested task which is supposed to do the actual initialization of lazy value, and the rest of threads are either waiting for a task, or just get stuck.
To Reproduce
Just paste the following snippet into any uniitest file, build and run it, normally the issue shows up on iteration# 200-400.
Desktop (please complete the following information):
taskflow 3.6, Windows 10, VS2019, x64 build and Ryzen 3800x CPU.
Additional context
No other information available.
The text was updated successfully, but these errors were encountered: