create_scheduling_group / scheduling_group_key_create are not safe to run in parallel #2231

piodul · 2024-05-09T07:00:07Z

I observed that running those two functions in parallel may lead to data of a scheduling group key initialized twice.

Those functions work like this:

create_scheduling_group
- Allocate a new SG ID
- Preemption may happen here
- On all shards:
  - Initialize storage for the SG
  - Read the current SG key count and store in num_keys variable
  - Preemption may happen here
  - Initialize SG keys for the new SG up to num_keys
scheduling_group_key_create
- Allocate a new SG key ID
- Preemption may happen here
- On all shards:
  - Initialize the SG key config
  - Preemption may happen here
  - For all current SGs, initialize the new SG key

While I am not exactly sure what the sequence of events for my case was, looking at the implementation I think the following can happen:

[create_scheduling_group] allocates a new SG ID
[scheduling_group_key_create] allocates a new SG key ID
[create_scheduling_group] reads the SG key count and stores to num_keys var
[scheduling_group_key_create] initializes storage for the new SG key
[create_scheduling_group] initializes storage for the new SG
[scheduling_group_key_create] initializes the new SG key for all existing SGs
[create_scheduling_group] initializes the new SG and all existing SG keys for it, including the new one.

Perhaps there might be other kinds of races. I think that operations like create_scheduling_group and scheduling_group_key_create should be serialized.

Relevant code:

seastar/src/core/reactor.cc

Lines 5016 to 5030 in d3657ec

 future<scheduling_group> 

 create_scheduling_group(sstring name, sstring shortname, float shares) noexcept { 

 auto aid = allocate_scheduling_group_id(); 

 if (aid < 0) { 

 return make_exception_future<scheduling_group>(std::runtime_error(fmt::format("Scheduling group limit exceeded while creating {}", name))); 

 } 

 auto id = static_cast<unsigned>(aid); 

 assert(id < max_scheduling_groups()); 

 auto sg = scheduling_group(id); 

 return smp::invoke_on_all([sg, name, shortname, shares] { 

 return engine().init_scheduling_group(sg, name, shortname, shares); 

 }).then([sg] { 

 return make_ready_future<scheduling_group>(sg); 

 }); 

 }

seastar/src/core/reactor.cc

Lines 4859 to 4926 in d3657ec

 reactor::allocate_scheduling_group_specific_data(scheduling_group sg, scheduling_group_key key) { 

 auto& sg_data = _scheduling_group_specific_data; 

 auto& this_sg = sg_data.per_scheduling_group_data[sg._id]; 

 this_sg.specific_vals.resize(std::max<size_t>(this_sg.specific_vals.size(), key.id()+1)); 

 this_sg.specific_vals[key.id()] = 

 aligned_alloc(sg_data.scheduling_group_key_configs[key.id()].alignment, 

 sg_data.scheduling_group_key_configs[key.id()].allocation_size); 

 if (!this_sg.specific_vals[key.id()]) { 

 std::abort(); 

 } 

 if (sg_data.scheduling_group_key_configs[key.id()].constructor) { 

 sg_data.scheduling_group_key_configs[key.id()].constructor(this_sg.specific_vals[key.id()]); 

 } 

 } 

 future<> 

 reactor::rename_scheduling_group_specific_data(scheduling_group sg) { 

 return with_scheduling_group(sg, [this, sg] { 

 auto& sg_data = _scheduling_group_specific_data; 

 auto& this_sg = sg_data.per_scheduling_group_data[sg._id]; 

 for (size_t i = 0; i < sg_data.scheduling_group_key_configs.size(); ++i) { 

 auto &c = sg_data.scheduling_group_key_configs[i]; 

 if (c.rename) { 

 (c.rename)(this_sg.specific_vals[i]); 

 } 

 } 

 }); 

 } 

 future<> 

 reactor::init_scheduling_group(seastar::scheduling_group sg, sstring name, sstring shortname, float shares) { 

 auto& sg_data = _scheduling_group_specific_data; 

 auto& this_sg = sg_data.per_scheduling_group_data[sg._id]; 

 this_sg.queue_is_initialized = true; 

 _task_queues.resize(std::max<size_t>(_task_queues.size(), sg._id + 1)); 

 _task_queues[sg._id] = std::make_unique<task_queue>(sg._id, name, shortname, shares); 

 unsigned long num_keys = s_next_scheduling_group_specific_key.load(std::memory_order_relaxed); 

 return with_scheduling_group(sg, [this, num_keys, sg] () { 

 for (unsigned long key_id = 0; key_id < num_keys; key_id++) { 

 allocate_scheduling_group_specific_data(sg, scheduling_group_key(key_id)); 

 } 

 }); 

 } 

 future<> 

 reactor::init_new_scheduling_group_key(scheduling_group_key key, scheduling_group_key_config cfg) { 

 auto& sg_data = _scheduling_group_specific_data; 

 sg_data.scheduling_group_key_configs.resize(std::max<size_t>(sg_data.scheduling_group_key_configs.size(), key.id() + 1)); 

 sg_data.scheduling_group_key_configs[key.id()] = cfg; 

 return parallel_for_each(_task_queues, [this, cfg, key] (std::unique_ptr<task_queue>& tq) { 

 if (tq) { 

 scheduling_group sg = scheduling_group(tq->_id); 

 if (tq.get() == _at_destroy_tasks) { 

 // fake the group by assuming it here 

 auto curr = current_scheduling_group(); 

 auto cleanup = defer([curr] () noexcept { *internal::current_scheduling_group_ptr() = curr; }); 

 *internal::current_scheduling_group_ptr() = sg; 

 allocate_scheduling_group_specific_data(sg, key); 

 } else { 

 return with_scheduling_group(sg, [this, key, sg] () { 

 allocate_scheduling_group_specific_data(sg, key); 

 }); 

 } 

 } 

 return make_ready_future(); 

 }); 

 }

The text was updated successfully, but these errors were encountered:

Due to scylladb/seastar#2231, creating a scheduling group and a scheduling group key is not safe to do in parallel. The service level code may attempt to create scheduling groups while the cql_transport::cql_sg_stats scheduling group key is being created. Until the seastar issue is fixed, move initialization of the cql sg states before service level initialization. Refs: scylladb/seastar#2231

avikivity · 2024-05-09T09:31:00Z

Makes sense. These are not data plane operations so can safely be serialized.

Due to scylladb/seastar#2231, creating a scheduling group and a scheduling group key is not safe to do in parallel. The service level code may attempt to create scheduling groups while the cql_transport::cql_sg_stats scheduling group key is being created. Until the seastar issue is fixed, move initialization of the cql sg states before service level initialization. Refs: scylladb/seastar#2231

Due to scylladb/seastar#2231, creating a scheduling group and a scheduling group key is not safe to do in parallel. The service level code may attempt to create scheduling groups while the cql_transport::cql_sg_stats scheduling group key is being created. Until the seastar issue is fixed, move initialization of the cql sg states before service level initialization. Refs: scylladb/seastar#2231 Closes #18581

piodul mentioned this issue May 9, 2024

main: initialize scheduling group keys before service levels scylladb/scylladb#18581

Closed

1 task

piodul self-assigned this May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create_scheduling_group / scheduling_group_key_create are not safe to run in parallel #2231

create_scheduling_group / scheduling_group_key_create are not safe to run in parallel #2231

piodul commented May 9, 2024

avikivity commented May 9, 2024

create_scheduling_group / scheduling_group_key_create are not safe to run in parallel #2231

create_scheduling_group / scheduling_group_key_create are not safe to run in parallel #2231

Comments

piodul commented May 9, 2024

avikivity commented May 9, 2024