alternator: keep TTL work in the maintenance scheduling group #18729

denesb · 2024-05-17T12:32:52Z

Alternator has a custom TTL implementation. This is based on a loop, which scans existing rows in the table, then decides whether each row have reached its end-of-life and deletes it if it did. This work is done in the background, and therefore it uses the maintenance (streaming) scheduling group. However, it was observed that part of this work leaks into the statement scheduling group, competing with user workloads, negatively affecting its latencies. This was found to be causes by the reads and writes done on behalf of the alternator TTL, which looses its maintenance scheduling group when these have to go to a remote node. This is because the messaging service was not configured to recognize the streaming scheduling group, when statement verbs like read or writes are invoked. The messaging service currently recognizes two statement "tenants": the user tenant (statement scheduling group) and system (default scheduling group), as we used to have only user-initiated operations and sytsem (internal) ones. With alternator TTL, there is now a need to distinguish between two kinds of system operation: foreground and background ones. The former should use the system tenant while the latter will use the new maintenance tenant (streaming scheduling group).
This series adds a streaming tenant to the messaging service configuration and it adds a test which confirms that with this change, alternator TTL is entirely contained in the maintenance scheduling group.

Fixes: #18719

Scans executed on behalf of alternator TTL are running in the statement group, disturbing user-workloads, this PR has to be backported to fix this.

denesb · 2024-05-17T12:34:26Z

@nyh please review.

I think this is all that is needed to make scans executed on behalf of alternator TTL keep their streaming scheduling group on the remote node. I want to add a test for this but don't know how to proceed. We can maybe use the infrastructure I introduced in #18705 but I don't even know how to trigger alternator TTL to scan a table.

scylladb-promoter · 2024-05-17T19:18:48Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest
✅ - dtest with topology changes
✅ - Unit Tests

Build Details:

Duration: 6 hr 45 min
Builder: i-01b4f25a667ed6a39 (r5ad.8xlarge)

nyh · 2024-05-19T12:41:59Z

@nyh please review.

I think this is all that is needed to make scans executed on behalf of alternator TTL keep their streaming scheduling group on the remote node. I want to add a test for this but don't know how to proceed. We can maybe use the infrastructure I introduced in #18705 but I don't even know how to trigger alternator TTL to scan a table.

Thanks for fixing this! I can't say I fully understand the patch - the patch is trivial, but it doesn't seem to actually "do" anything, there has to be other code (which I'm not familiar with) which makes sure that if the new "tenant" exists, it is actually used in the right way.

I'll try to write a test for this myself. I know how to trigger alternator TTL (we have tests for that already), but:

It will need to be a multi-node test so it needs to be in test/topology*, not test/alternator - this issue is only triggered when sending reads to remote nodes.
I'll need to somehow look at metrics or something to see which scheduling group is used or not use. I'm not sure how easy it will be :-(

By the way, I wonder if this fix only affects Alternator TTL. What about repair? For example, repair with materialized views has a background process of "view building", which writes view updates to remote nodes. Which scheduling group did those remote writes use before, and now? Or maybe writes are a different story from reads anyway?

nyh · 2024-05-19T12:50:53Z

By the way, @denesb, it would have been really nice to have a document in docs/dev which explains how the whole scheduling-inheritance-via-RPC business works. In the past I started docs/dev/isolation.md, but I didn't know how to explain RPC so I left a TODO, referring to commit 8c993e0 which by now is either obsolete or only part of the story.
I also mentioned priority classes (for IO) being separate from scheduling groups (for CPU), and I'm not sure this is still true.
I guess what I'm trying to say that perhaps I could inspire you to take a look at docs/dev/isolation.md and update it with the things you know and I didn't, or changed since I wrote it.

nyh · 2024-05-19T14:27:52Z

I'll try to write a test for this myself.

I'm working on this now.
By the way, if I remember correctly, @fruch believes that he saw the TTL performance problems (large latency jumps during the expiration scanning) during Alternator longevity tests with TTL - so he may be able to test this fix even without writing a new test. (although, I still want to write a new shorter test for this).

avikivity · 2024-05-19T14:58:52Z

main.cc

+ scfg.statement_tenants = {
+ {dbcfg.statement_scheduling_group, "$user"},
+ {default_scheduling_group(), "$system"},
+ {dbcfg.streaming_scheduling_group, "$maintenance"}


What happens on upgrade?

I expect it should just work - an old server node will not understand $maintenance and direct it to the statement group.

Yes, an old server will just fall-back to the default tenant. Which is the pre-patch behaviour.

avikivity · 2024-05-19T15:00:03Z

We'll want a test in test.py (relying on metrics rather than performance).

nyh · 2024-05-19T15:07:16Z

We'll want a test in test.py (relying on metrics rather than performance).

Yes, that's exactly my plan - a test in test.py (test/topology_experimental_raft - the directory organization there is a mess and there's no good place), which tries to see that the metrics for the default group increase too much while nothing is happening - e.g. imagine writing 1000 rows and then a second later they expire - at that point the work of the maintenance group should increase (we don't know exactly how much) but the work of the default group shouldn't increase at all

denesb · 2024-05-20T06:24:48Z

Thanks for fixing this! I can't say I fully understand the patch - the patch is trivial, but it doesn't seem to actually "do" anything, there has to be other code (which I'm not familiar with) which makes sure that if the new "tenant" exists, it is actually used in the right way.

This code already exists and it just has to be configured with all the different tenants that might be used by callers. Adding a maintenance tenant to the configuration, is all that is required to get this to work.

I'll try to write a test for this myself. I know how to trigger alternator TTL (we have tests for that already), but:
1. It will need to be a multi-node test so it needs to be in test/topology*, not test/alternator - this issue is only triggered when sending reads to remote nodes.

2. I'll need to somehow look at metrics or something to see which scheduling group is used or not use. I'm not sure how easy it will be :-(
By the way, I wonder if this fix only affects Alternator TTL. What about repair? For example, repair with materialized views has a background process of "view building", which writes view updates to remote nodes. Which scheduling group did those remote writes use before, and now? Or maybe writes are a different story from reads anyway?

Indeed, this fix will affect writes and even LWT (Paxos and Raft). This is actually good. Not preserving the scheduling group across RPC calls is surprising to say the least. Any change in the scheduling groups should be deliberate.

denesb · 2024-05-20T06:28:51Z

By the way, @denesb, it would have been really nice to have a document in docs/dev which explains how the whole scheduling-inheritance-via-RPC business works. In the past I started docs/dev/isolation.md, but I didn't know how to explain RPC so I left a TODO, referring to commit 8c993e0 which by now is either obsolete or only part of the story.
I also mentioned priority classes (for IO) being separate from scheduling groups (for CPU), and I'm not sure this is still true.

These are now unified, for around half-a year (starting with 5.4/2024.1).

I guess what I'm trying to say that perhaps I could inspire you to take a look at docs/dev/isolation.md and update it with the things you know and I didn't, or changed since I wrote it.

I will have a look.

denesb · 2024-05-20T08:22:01Z

I guess what I'm trying to say that perhaps I could inspire you to take a look at docs/dev/isolation.md and update it with the things you know and I didn't, or changed since I wrote it.

I will have a look.

See #18749

avikivity · 2024-05-20T09:46:12Z

By the way, I wonder if this fix only affects Alternator TTL. What about repair? For example, repair with materialized views has a background process of "view building", which writes view updates to remote nodes. Which scheduling group did those remote writes use before, and now? Or maybe writes are a different story from reads anyway?

Indeed, this fix will affect writes and even LWT (Paxos and Raft). This is actually good. Not preserving the scheduling group across RPC calls is surprising to say the least. Any change in the scheduling groups should be deliberate.

It only affects queries initiated by the maintenance scheduling group. If it affected post-repair (and post-streaming) view building that's even more important than TTL, I'm surprised we haven't seen it yet.

I don't see how it can affect LWT.

It will affect Raft work if it's in the maintenance scheduling group, but that's low bandwidth anyway.

denesb · 2024-05-20T13:48:30Z

There is a test proposed in #18757, which confirms that this PR works. So this PR is ready to be merged.

avikivity · 2024-05-20T13:50:55Z

It's better to fold the test into the fix, so if we backport it we don't forget the test.

denesb · 2024-05-20T14:05:15Z

New in v2:

Include the test contributed by @nyh.
Update the title and cover-letter to explain what the problem with alternator TTL was and how this PR fixes it.

nyh · 2024-05-20T14:17:21Z

By the way, I wonder if this fix only affects Alternator TTL. What about repair? For example, repair with materialized views has a background process of "view building", which writes view updates to remote nodes. Which scheduling group did those remote writes use before, and now? Or maybe writes are a different story from reads anyway?

It only affects queries initiated by the maintenance scheduling group. If it affected post-repair (and post-streaming) view building that's even more important than TTL, I'm surprised we haven't seen it yet.

I wouldn't be surprised if we did see it - every once in a while we do have reports of problems with the post-repair view updates. I think we just folded these issue into the general "MV sucks" sentiment and "everything will be fixed as soon as we change MV's consistency/flow-control/whatever".

I already spent way too much time on the Alternator TTL test that proves that before this patch Alternator TTL work "leaked" into the statement group, but probably exactly the same approach can be used to prove (or disprove) that post- repair view build also leaked to the statement group.

I wonder if this also affects hint replay on a base table with materialized views.

nyh

I'm approving, but because I wrote half of the PR (the test), I can't approve my own patch so we'll need more approvals.

nyh · 2024-05-22T08:48:09Z

@scylladb/scylla-maint please review and consider merging. Neither @denesb nor I can do it because we each wrote half the PR...

fruch

LGTM

nyh · 2024-05-22T10:58:21Z

@scylladb/scylla-maint please review and consider merging. Neither @denesb nor I can do it because we each wrote half the PR...

@mykaul this fixes the performance problems with Alternator TTL seen by several customers, I think we should try to get it in to 6.0.

scylladb-promoter · 2024-05-29T12:42:56Z

🔴 CI State: FAILURE

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 alternator
🔹 topology_experimental_raft/test_mv_tablets
✅ - Container Test
✅ - dtest
✅ - dtest with topology changes
❌ - Unit Tests

Failed Tests (1/270245):

test_snapshot_cursor_is_consistent_with_merging 🔍

Build Details:

Duration: 5 hr 46 min
Builder: spider8.cloudius-systems.com

scylladb-promoter · 2024-05-29T22:27:32Z

🔴 CI State: FAILURE

✅ - Build
❌ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 alternator
🔹 topology_experimental_raft/test_mv_tablets

Failed Tests (8/239705):

Build Details:

Duration: 9 hr 41 min
Builder: i-0933865d427dea4ff (r5ad.8xlarge)

denesb · 2024-05-30T06:02:20Z

@nyh the test failed again, please have a look.
Maybe if we need multiple update rounds for the test, it is better to close this PR and for you to re-open one that you own.

nyh · 2024-06-02T06:49:57Z

🔴 CI State: FAILURE

✅ - Build ❌ - Unit Tests Custom The following new/updated tests ran 100 times for each mode: 🔹 alternator 🔹 topology_experimental_raft/test_mv_tablets

Failed Tests (8/239705):

* [alternator.test_metrics.debug.9](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9323/testReport/junit/%28root%29/non-boost%20tests/alternator_test_metrics_debug_9) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+alternator.test_metrics.debug.9)

* [alternator.test_metrics.debug.15](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9323/testReport/junit/%28root%29/non-boost%20tests/alternator_test_metrics_debug_15) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+alternator.test_metrics.debug.15)

* [alternator.test_metrics.debug.19](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9323/testReport/junit/%28root%29/non-boost%20tests/alternator_test_metrics_debug_19) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+alternator.test_metrics.debug.19)

* [alternator.test_metrics.debug.26](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9323/testReport/junit/%28root%29/non-boost%20tests/alternator_test_metrics_debug_26) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+alternator.test_metrics.debug.26)

* [test_item_latency](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9323/testReport/junit/%28root%29/test_metrics/test_item_latency) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_item_latency)

* [test_item_latency](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9323/testReport/junit/%28root%29/test_metrics/test_item_latency_2) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_item_latency)

* [test_item_latency](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9323/testReport/junit/%28root%29/test_metrics/test_item_latency_3) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_item_latency)

* [test_item_latency](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9323/testReport/junit/%28root%29/test_metrics/test_item_latency_4) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_item_latency)

Build Details:

* Duration: 9 hr 41 min

* Builder: i-0933865d427dea4ff (r5ad.8xlarge)

Known flaky test #18847, which I need to fix but not related to this PR. I'll rerun the CI.

nyh · 2024-06-02T06:55:22Z

Known flaky test #18847, which I need to fix but not related to this PR. I'll rerun the CI.

Easier said than done - I can't seem to reach Jenkins so I couldn't restart the CI (I do the usual trick of rebasing the PR because @denesb owns it, not me). @yaronkaikov is Jenkins temporarily down? Some new security protocol I wasn't told about?

yaronkaikov · 2024-06-02T06:57:19Z

Known flaky test #18847, which I need to fix but not related to this PR. I'll rerun the CI.

Easier said than done - I can't seem to reach Jenkins so I couldn't restart the CI (I do the usual trick of rebasing the PR because @denesb owns it, not me). @yaronkaikov is Jenkins temporarily down? Some new security protocol I wasn't told about?

It's not down, but very slow. we are checking it

nyh · 2024-06-02T09:32:23Z

🔴 CI State: FAILURE

✅ - Build ✅ - Unit Tests Custom The following new/updated tests ran 100 times for each mode: 🔹 alternator 🔹 topology_experimental_raft/test_mv_tablets ✅ - Container Test ✅ - dtest ✅ - dtest with topology changes ❌ - Unit Tests

Failed Tests (1/270245):
* [test_snapshot_cursor_is_consistent_with_merging](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9293/testReport/junit/boost/mvcc_test/test_snapshot_cursor_is_consistent_with_merging) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_snapshot_cursor_is_consistent_with_merging)
Build Details:
* Duration: 5 hr 46 min

* Builder: spider8.cloudius-systems.com

Another known (but apparently rare) flakiness: #13642

nyh · 2024-06-02T09:40:55Z

@scylladb/scylla-maint I restarted the CI, but I had a chat with @mykaul and he raised a good point:

When we run CI on an important patch (this patch fixes a P1 bug!), and it fails only one an already-known flaky test, what's the point of waiting another workday for the CI to run again? If the CI failed only on a test which we know has a pre-existing problem and is unrelated to the code being changed in the patch, we could consider the CI having basically passed. What's the point of waiting for the flaky test to succeed, and risk additional flaky tests suddenly failing, ad nauseum?

Let's find a way to get this PR committed. We've been "sitting on it" for two weeks already :-(

nyh · 2024-06-02T14:01:10Z

@yaronkaikov I tried, twice, to restart the CI, and I think it did run (but not sure how to check it now), but nothing changed in the "some checks were not successful" section above which continues to show some old error. So reviewers keep thinking that this PR is broken, when it isn't. Any idea what I can do? Did I do something wrong?

yaronkaikov · 2024-06-02T14:05:15Z

it seems it still working https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/9400/ - currently in Unit Tests Custom stage

nyh · 2024-06-02T14:11:26Z

it seems it still working https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/9400/ - currently in Unit Tests Custom stage

Thanks. So why wasn't the first thing it did was to remove the old "checks were not successful" and replace them by "pending" marks, as usual? I thought this is what usually happens, but maybe I'm misremembering and that "pending" thing only happens in the first run and not subsequent runs? (if so, why?)

yaronkaikov · 2024-06-02T14:24:15Z

it seems it still working https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/9400/ - currently in Unit Tests Custom stage

Thanks. So why wasn't the first thing it did was to remove the old "checks were not successful" and replace them by "pending" marks, as usual? I thought this is what usually happens, but maybe I'm misremembering and that "pending" thing only happens in the first run and not subsequent runs? (if so, why?)

It is not supposed to remove it (or at least it's not what we are doing until today :-), the status is pending for Unit Tests Custom so it means it's running now. I agree it can be not very clear, we can check if there is anything we can do about it

nyh · 2024-06-02T14:29:16Z

It is not supposed to remove it (or at least it's not what we are doing until today :-), the status is pending for Unit Tests Custom so it means it's running now. I agree it can be not very clear, we can check if there is anything we can do about it

Oh, somehow I missed the "Unit Test Custom" line. I looked at the "Unit Test" line which still shows some old failure. Is it because this test is not running now? Why is it not running now?

yaronkaikov · 2024-06-02T14:37:33Z

It is not supposed to remove it (or at least it's not what we are doing until today :-), the status is pending for Unit Tests Custom so it means it's running now. I agree it can be not very clear, we can check if there is anything we can do about it

Oh, somehow I missed the "Unit Test Custom" line. I looked at the "Unit Test" line which still shows some old failure. Is it because this test is not running now? Why is it not running now?

the flow is as follows

Build
Unit tests custom (optional, only if code changed for specific tests)
Unit tests, dtest, and dtest with topology changes (running in parallel)

nyh · 2024-06-02T14:40:07Z

Oh, I wasn't aware the the post-build steps don't even start (and don't update the test result tab) until the build is done.

scylladb-promoter · 2024-06-02T18:13:16Z

🔴 CI State: FAILURE

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 alternator
🔹 topology_experimental_raft/test_mv_tablets
✅ - Container Test
✅ - dtest
❌ - dtest with topology changes
✅ - Unit Tests

Build Details:

Duration: 4 hr 57 min
Builder: spider2.cloudius-systems.com

avikivity · 2024-06-02T19:22:35Z

@scylladb/scylla-maint I restarted the CI, but I had a chat with @mykaul and he raised a good point:

When we run CI on an important patch (this patch fixes a P1 bug!), and it fails only one an already-known flaky test, what's the point of waiting another workday for the CI to run again? If the CI failed only on a test which we know has a pre-existing problem and is unrelated to the code being changed in the patch, we could consider the CI having basically passed. What's the point of waiting for the flaky test to succeed, and risk additional flaky tests suddenly failing, ad nauseum?

Let's find a way to get this PR committed. We've been "sitting on it" for two weeks already :-(

This is a good way to find excuses not to address CI flakiness problems.

nyh · 2024-06-03T08:42:23Z

This is a good way to find excuses not to address CI flakiness problems.

I'm not sure that a user who has been waiting two weeks for a fix that was already written cares so much about our tests flakiness problems :-( It doesn't mean, of course, that we shouldn't fix the tests, but user bugs shouldn't be held "hostage" until unrelated tests are fixed - that also doesn't make sense.

mykaul · 2024-06-03T08:45:12Z

This one is even more interesting - it is failing on feature that does not exist yet in the version the user that is waiting for the fix has (in this specific case, topology over Raft).

nyh · 2024-06-03T08:49:26Z

🔴 CI State: FAILURE

✅ - Build ✅ - Unit Tests Custom The following new/updated tests ran 100 times for each mode: 🔹 alternator 🔹 topology_experimental_raft/test_mv_tablets ✅ - Container Test ✅ - dtest ❌ - dtest with topology changes ✅ - Unit Tests

Build Details:
* Duration: 4 hr 57 min

* Builder: spider2.cloudius-systems.com

The failure is in the dtest update_cluster_layout_tests.TestUpdateClusterLayout.test_increment_decrement_counters_in_threads_nodes_restarted

Known rare flakiness for about a year: https://github.com/scylladb/scylla-dtest/issues/3686

avikivity · 2024-06-03T10:12:52Z

This is a good way to find excuses not to address CI flakiness problems.

I'm not sure that a user who has been waiting two weeks for a fix that was already written cares so much about our tests flakiness problems :-( It doesn't mean, of course, that we shouldn't fix the tests, but user bugs shouldn't be held "hostage" until unrelated tests are fixed - that also doesn't make sense.

Then the tests will never be fixed. You'll always find something more important, or someone who has doesn't care so much about our tests flakiness problems, and the problem will grow.

kbr-scylla · 2024-06-04T12:39:43Z

@avikivity What do you expect?

The issue for the failing test was created. It was assigned to the appropriate team leader. The team leader (hopefully) did the planning, prioritized it and put it in the correct order in their team backlog. It will get addressed in the order of highest priority (there are other issues like that.)

And unrelated PR getting blocked from being merged will not change a thing about that process. It will not magically spawn more developers to allow backlogs getting cleared faster. Or improve productivity of existing developers.

OTOH perhaps delaying PRs from getting merged due to CI failures serves as a backpressure mechanism. The more CI failures we have, the less frequent merging PRs is, so we introduce regressions less frequently, and the backlog grows slower. I can see a point in that. BUT maybe we shouldn't do it by heating the planet (restarting CI again and again just so unrelated flaky test passes), perhaps we should make it more explicit somehow.

mykaul · 2024-06-04T12:45:41Z

@avikivity What do you expect?

The issue for the failing test was created. It was assigned to the appropriate team leader. The team leader (hopefully) did the planning, prioritized it and put it in the correct order in their team backlog. It will get addressed in the order of highest priority (there are other issues like that.)

And unrelated PR getting blocked from being merged will not change a thing about that process. It will not magically spawn more developers to allow backlogs getting cleared faster. Or improve productivity of existing developers.

OTOH perhaps delaying PRs from getting merged due to CI failures serves as a backpressure mechanism. The more CI failures we have, the less frequent merging PRs is, so we introduce regressions less frequently, and the backlog grows slower. I can see a point in that. BUT maybe we shouldn't do it by heating the planet (restarting CI again and again just so unrelated flaky test passes), perhaps we should make it more explicit somehow.

A middle ground would be to allow bug fixes only to get merged (assuming the CI failures are unrelated to the fix) in spite of the CI failures. This is under the assumption it does make the code better (perhaps fixing other CI failures, etc.). Won't be allowed for improvements/features/enhancements/etc.
It's still a slippery slope. At the end of the day, it is about to the maintainers and the team leads to work with their teams to reduce their backlog of CI issues an I believe it's a better path than re-runs.

avikivity · 2024-06-04T15:56:57Z

@avikivity What do you expect?

The issue for the failing test was created. It was assigned to the appropriate team leader. The team leader (hopefully) did the planning, prioritized it and put it in the correct order in their team backlog. It will get addressed in the order of highest priority (there are other issues like that.)

And unrelated PR getting blocked from being merged will not change a thing about that process. It will not magically spawn more developers to allow backlogs getting cleared faster. Or improve productivity of existing developers.

OTOH perhaps delaying PRs from getting merged due to CI failures serves as a backpressure mechanism. The more CI failures we have, the less frequent merging PRs is, so we introduce regressions less frequently, and the backlog grows slower. I can see a point in that. BUT maybe we shouldn't do it by heating the planet (restarting CI again and again just so unrelated flaky test passes), perhaps we should make it more explicit somehow.

I expect priority to be moved from things that are impacted by CI failures, to fixing the CI failures.

scylladb-promoter · 2024-06-05T05:29:24Z

🔴 CI State: FAILURE

✅ - Build
❌ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 alternator
🔹 topology_experimental_raft/test_mv_tablets

Failed Tests (6/239705):

Build Details:

Duration: 8 hr 51 min
Builder: i-0781cb8cde16924e6 (m5ad.8xlarge)

nyh · 2024-06-09T11:08:54Z

🔴 CI State: FAILURE

✅ - Build ❌ - Unit Tests Custom The following new/updated tests ran 100 times for each mode: 🔹 alternator 🔹 topology_experimental_raft/test_mv_tablets

Failed Tests (6/239705):

* [alternator.test_metrics.debug.2](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9471/testReport/junit/%28root%29/non-boost%20tests/alternator_test_metrics_debug_2) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+alternator.test_metrics.debug.2)

* [alternator.test_metrics.debug.4](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9471/testReport/junit/%28root%29/non-boost%20tests/alternator_test_metrics_debug_4) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+alternator.test_metrics.debug.4)

* [alternator.test_metrics.debug.33](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9471/testReport/junit/%28root%29/non-boost%20tests/alternator_test_metrics_debug_33) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+alternator.test_metrics.debug.33)

* [test_item_latency](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9471/testReport/junit/%28root%29/test_metrics/test_item_latency) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_item_latency)

* [test_item_latency](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9471/testReport/junit/%28root%29/test_metrics/test_item_latency_2) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_item_latency)

* [test_item_latency](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9471/testReport/junit/%28root%29/test_metrics/test_item_latency_3) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_item_latency)

Build Details:

* Duration: 8 hr 51 min

* Builder: i-0781cb8cde16924e6 (m5ad.8xlarge)

The test_item_latency is known flakiness which I have already fixed in #19080 but wasn't merged yet because of some OTHER flaky test failing its CI. The irony :-(

I'll restart the CI.

scylladb-promoter · 2024-06-09T15:53:37Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 alternator
🔹 topology_experimental_raft/test_mv_tablets
✅ - Container Test
✅ - dtest
✅ - dtest with topology changes
✅ - Unit Tests

Build Details:

Duration: 4 hr 44 min
Builder: spider8.cloudius-systems.com

… from Botond Dénes Alternator has a custom TTL implementation. This is based on a loop, which scans existing rows in the table, then decides whether each row have reached its end-of-life and deletes it if it did. This work is done in the background, and therefore it uses the maintenance (streaming) scheduling group. However, it was observed that part of this work leaks into the statement scheduling group, competing with user workloads, negatively affecting its latencies. This was found to be causes by the reads and writes done on behalf of the alternator TTL, which looses its maintenance scheduling group when these have to go to a remote node. This is because the messaging service was not configured to recognize the streaming scheduling group, when statement verbs like read or writes are invoked. The messaging service currently recognizes two statement "tenants": the user tenant (statement scheduling group) and system (default scheduling group), as we used to have only user-initiated operations and sytsem (internal) ones. With alternator TTL, there is now a need to distinguish between two kinds of system operation: foreground and background ones. The former should use the system tenant while the latter will use the new maintenance tenant (streaming scheduling group). This series adds a streaming tenant to the messaging service configuration and it adds a test which confirms that with this change, alternator TTL is entirely contained in the maintenance scheduling group. Fixes: #18719 - [x] Scans executed on behalf of alternator TTL are running in the statement group, disturbing user-workloads, this PR has to be backported to fix this. Closes #18729 * github.com:scylladb/scylladb: alternator, scheduler: test reproducing RPC scheduling group bug main: add maintenance tenant to messaging_service's scheduling config

denesb requested a review from nyh May 17, 2024 12:32

denesb self-assigned this May 17, 2024

github-actions bot added area/internals an issue which refers to some internal class or something which has little exposure to users and is backport/5.2 backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed backport/6.0 labels May 17, 2024

nyh mentioned this pull request May 19, 2024

messaging_service: add streaming/maintenance tenant to statement tenants #18719

Open

avikivity reviewed May 19, 2024

View reviewed changes

nyh mentioned this pull request May 20, 2024

alternator, scheduler: test reproducing RPC scheduling group bug #18757

Closed

denesb force-pushed the massaging-service-streaming-tenant branch from 18e15d3 to dae5aec Compare May 20, 2024 13:55

denesb changed the title ~~main: add maintenance tenant to messaging_service's scheduling config~~ alternator: keep TTL work in the maintenance scheduling group May 20, 2024

nyh approved these changes May 20, 2024

View reviewed changes

fruch approved these changes May 22, 2024

View reviewed changes

mykaul added P1 Urgent and removed backport/5.2 labels May 30, 2024

alternator: keep TTL work in the maintenance scheduling group #18729

Are you sure you want to change the base?

alternator: keep TTL work in the maintenance scheduling group #18729

Conversation

denesb commented May 17, 2024 • edited

denesb commented May 17, 2024

scylladb-promoter commented May 17, 2024

🟢 CI State: SUCCESS

Build Details:

nyh commented May 19, 2024

nyh commented May 19, 2024

nyh commented May 19, 2024

avikivity May 19, 2024

Choose a reason for hiding this comment

denesb May 20, 2024

Choose a reason for hiding this comment

avikivity commented May 19, 2024

nyh commented May 19, 2024

denesb commented May 20, 2024

denesb commented May 20, 2024

denesb commented May 20, 2024

avikivity commented May 20, 2024

denesb commented May 20, 2024

avikivity commented May 20, 2024

denesb commented May 20, 2024

nyh commented May 20, 2024

nyh left a comment

Choose a reason for hiding this comment

nyh commented May 22, 2024

fruch left a comment

Choose a reason for hiding this comment

nyh commented May 22, 2024

scylladb-promoter commented May 29, 2024

🔴 CI State: FAILURE

Failed Tests (1/270245):

Build Details:

scylladb-promoter commented May 29, 2024

🔴 CI State: FAILURE

Failed Tests (8/239705):

Build Details:

denesb commented May 30, 2024

nyh commented Jun 2, 2024

🔴 CI State: FAILURE

Failed Tests (8/239705):

Build Details:

nyh commented Jun 2, 2024

yaronkaikov commented Jun 2, 2024

nyh commented Jun 2, 2024

🔴 CI State: FAILURE

Failed Tests (1/270245):

Build Details:

nyh commented Jun 2, 2024

nyh commented Jun 2, 2024

yaronkaikov commented Jun 2, 2024

nyh commented Jun 2, 2024

yaronkaikov commented Jun 2, 2024

nyh commented Jun 2, 2024

yaronkaikov commented Jun 2, 2024 • edited

nyh commented Jun 2, 2024

scylladb-promoter commented Jun 2, 2024

🔴 CI State: FAILURE

Build Details:

avikivity commented Jun 2, 2024

nyh commented Jun 3, 2024

mykaul commented Jun 3, 2024

nyh commented Jun 3, 2024

🔴 CI State: FAILURE

Build Details:

avikivity commented Jun 3, 2024

kbr-scylla commented Jun 4, 2024

mykaul commented Jun 4, 2024

avikivity commented Jun 4, 2024

scylladb-promoter commented Jun 5, 2024

🔴 CI State: FAILURE

Failed Tests (6/239705):

Build Details:

nyh commented Jun 9, 2024

🔴 CI State: FAILURE

Failed Tests (6/239705):

Build Details:

denesb commented May 17, 2024 •

edited

yaronkaikov commented Jun 2, 2024 •

edited