Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create_scheduling_group / scheduling_group_key_create are not safe to run in parallel #2231

Open
piodul opened this issue May 9, 2024 · 1 comment
Assignees

Comments

@piodul
Copy link
Contributor

piodul commented May 9, 2024

I observed that running those two functions in parallel may lead to data of a scheduling group key initialized twice.

Those functions work like this:

  • create_scheduling_group

    • Allocate a new SG ID
    • Preemption may happen here
    • On all shards:
      • Initialize storage for the SG
      • Read the current SG key count and store in num_keys variable
      • Preemption may happen here
      • Initialize SG keys for the new SG up to num_keys
  • scheduling_group_key_create

    • Allocate a new SG key ID
    • Preemption may happen here
    • On all shards:
      • Initialize the SG key config
      • Preemption may happen here
      • For all current SGs, initialize the new SG key

While I am not exactly sure what the sequence of events for my case was, looking at the implementation I think the following can happen:

  • [create_scheduling_group] allocates a new SG ID
  • [scheduling_group_key_create] allocates a new SG key ID
  • [create_scheduling_group] reads the SG key count and stores to num_keys var
  • [scheduling_group_key_create] initializes storage for the new SG key
  • [create_scheduling_group] initializes storage for the new SG
  • [scheduling_group_key_create] initializes the new SG key for all existing SGs
  • [create_scheduling_group] initializes the new SG and all existing SG keys for it, including the new one.

Perhaps there might be other kinds of races. I think that operations like create_scheduling_group and scheduling_group_key_create should be serialized.

Relevant code:

seastar/src/core/reactor.cc

Lines 5016 to 5030 in d3657ec

future<scheduling_group>
create_scheduling_group(sstring name, sstring shortname, float shares) noexcept {
auto aid = allocate_scheduling_group_id();
if (aid < 0) {
return make_exception_future<scheduling_group>(std::runtime_error(fmt::format("Scheduling group limit exceeded while creating {}", name)));
}
auto id = static_cast<unsigned>(aid);
assert(id < max_scheduling_groups());
auto sg = scheduling_group(id);
return smp::invoke_on_all([sg, name, shortname, shares] {
return engine().init_scheduling_group(sg, name, shortname, shares);
}).then([sg] {
return make_ready_future<scheduling_group>(sg);
});
}

seastar/src/core/reactor.cc

Lines 4859 to 4926 in d3657ec

reactor::allocate_scheduling_group_specific_data(scheduling_group sg, scheduling_group_key key) {
auto& sg_data = _scheduling_group_specific_data;
auto& this_sg = sg_data.per_scheduling_group_data[sg._id];
this_sg.specific_vals.resize(std::max<size_t>(this_sg.specific_vals.size(), key.id()+1));
this_sg.specific_vals[key.id()] =
aligned_alloc(sg_data.scheduling_group_key_configs[key.id()].alignment,
sg_data.scheduling_group_key_configs[key.id()].allocation_size);
if (!this_sg.specific_vals[key.id()]) {
std::abort();
}
if (sg_data.scheduling_group_key_configs[key.id()].constructor) {
sg_data.scheduling_group_key_configs[key.id()].constructor(this_sg.specific_vals[key.id()]);
}
}
future<>
reactor::rename_scheduling_group_specific_data(scheduling_group sg) {
return with_scheduling_group(sg, [this, sg] {
auto& sg_data = _scheduling_group_specific_data;
auto& this_sg = sg_data.per_scheduling_group_data[sg._id];
for (size_t i = 0; i < sg_data.scheduling_group_key_configs.size(); ++i) {
auto &c = sg_data.scheduling_group_key_configs[i];
if (c.rename) {
(c.rename)(this_sg.specific_vals[i]);
}
}
});
}
future<>
reactor::init_scheduling_group(seastar::scheduling_group sg, sstring name, sstring shortname, float shares) {
auto& sg_data = _scheduling_group_specific_data;
auto& this_sg = sg_data.per_scheduling_group_data[sg._id];
this_sg.queue_is_initialized = true;
_task_queues.resize(std::max<size_t>(_task_queues.size(), sg._id + 1));
_task_queues[sg._id] = std::make_unique<task_queue>(sg._id, name, shortname, shares);
unsigned long num_keys = s_next_scheduling_group_specific_key.load(std::memory_order_relaxed);
return with_scheduling_group(sg, [this, num_keys, sg] () {
for (unsigned long key_id = 0; key_id < num_keys; key_id++) {
allocate_scheduling_group_specific_data(sg, scheduling_group_key(key_id));
}
});
}
future<>
reactor::init_new_scheduling_group_key(scheduling_group_key key, scheduling_group_key_config cfg) {
auto& sg_data = _scheduling_group_specific_data;
sg_data.scheduling_group_key_configs.resize(std::max<size_t>(sg_data.scheduling_group_key_configs.size(), key.id() + 1));
sg_data.scheduling_group_key_configs[key.id()] = cfg;
return parallel_for_each(_task_queues, [this, cfg, key] (std::unique_ptr<task_queue>& tq) {
if (tq) {
scheduling_group sg = scheduling_group(tq->_id);
if (tq.get() == _at_destroy_tasks) {
// fake the group by assuming it here
auto curr = current_scheduling_group();
auto cleanup = defer([curr] () noexcept { *internal::current_scheduling_group_ptr() = curr; });
*internal::current_scheduling_group_ptr() = sg;
allocate_scheduling_group_specific_data(sg, key);
} else {
return with_scheduling_group(sg, [this, key, sg] () {
allocate_scheduling_group_specific_data(sg, key);
});
}
}
return make_ready_future();
});
}

piodul added a commit to piodul/scylla that referenced this issue May 9, 2024
Due to scylladb/seastar#2231, creating a scheduling group and a
scheduling group key is not safe to do in parallel. The service level
code may attempt to create scheduling groups while
the cql_transport::cql_sg_stats scheduling group key is being created.

Until the seastar issue is fixed, move initialization of the cql sg
states before service level initialization.

Refs: scylladb/seastar#2231
@piodul piodul self-assigned this May 9, 2024
@avikivity
Copy link
Member

Makes sense. These are not data plane operations so can safely be serialized.

piodul added a commit to piodul/scylla that referenced this issue May 9, 2024
Due to scylladb/seastar#2231, creating a scheduling group and a
scheduling group key is not safe to do in parallel. The service level
code may attempt to create scheduling groups while
the cql_transport::cql_sg_stats scheduling group key is being created.

Until the seastar issue is fixed, move initialization of the cql sg
states before service level initialization.

Refs: scylladb/seastar#2231
denesb pushed a commit to scylladb/scylladb that referenced this issue May 10, 2024
Due to scylladb/seastar#2231, creating a scheduling group and a
scheduling group key is not safe to do in parallel. The service level
code may attempt to create scheduling groups while
the cql_transport::cql_sg_stats scheduling group key is being created.

Until the seastar issue is fixed, move initialization of the cql sg
states before service level initialization.

Refs: scylladb/seastar#2231

Closes #18581
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants