Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alternator: keep TTL work in the maintenance scheduling group #18729

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

denesb
Copy link
Contributor

@denesb denesb commented May 17, 2024

Alternator has a custom TTL implementation. This is based on a loop, which scans existing rows in the table, then decides whether each row have reached its end-of-life and deletes it if it did. This work is done in the background, and therefore it uses the maintenance (streaming) scheduling group. However, it was observed that part of this work leaks into the statement scheduling group, competing with user workloads, negatively affecting its latencies. This was found to be causes by the reads and writes done on behalf of the alternator TTL, which looses its maintenance scheduling group when these have to go to a remote node. This is because the messaging service was not configured to recognize the streaming scheduling group, when statement verbs like read or writes are invoked. The messaging service currently recognizes two statement "tenants": the user tenant (statement scheduling group) and system (default scheduling group), as we used to have only user-initiated operations and sytsem (internal) ones. With alternator TTL, there is now a need to distinguish between two kinds of system operation: foreground and background ones. The former should use the system tenant while the latter will use the new maintenance tenant (streaming scheduling group).
This series adds a streaming tenant to the messaging service configuration and it adds a test which confirms that with this change, alternator TTL is entirely contained in the maintenance scheduling group.

Fixes: #18719

  • Scans executed on behalf of alternator TTL are running in the statement group, disturbing user-workloads, this PR has to be backported to fix this.

@denesb denesb requested a review from nyh May 17, 2024 12:32
@denesb denesb self-assigned this May 17, 2024
@github-actions github-actions bot added area/internals an issue which refers to some internal class or something which has little exposure to users and is backport/5.2 backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed backport/6.0 labels May 17, 2024
@denesb
Copy link
Contributor Author

denesb commented May 17, 2024

@nyh please review.

I think this is all that is needed to make scans executed on behalf of alternator TTL keep their streaming scheduling group on the remote node. I want to add a test for this but don't know how to proceed. We can maybe use the infrastructure I introduced in #18705 but I don't even know how to trigger alternator TTL to scan a table.

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest
✅ - dtest with topology changes
✅ - Unit Tests

Build Details:

  • Duration: 6 hr 45 min
  • Builder: i-01b4f25a667ed6a39 (r5ad.8xlarge)

@nyh
Copy link
Contributor

nyh commented May 19, 2024

@nyh please review.

I think this is all that is needed to make scans executed on behalf of alternator TTL keep their streaming scheduling group on the remote node. I want to add a test for this but don't know how to proceed. We can maybe use the infrastructure I introduced in #18705 but I don't even know how to trigger alternator TTL to scan a table.

Thanks for fixing this! I can't say I fully understand the patch - the patch is trivial, but it doesn't seem to actually "do" anything, there has to be other code (which I'm not familiar with) which makes sure that if the new "tenant" exists, it is actually used in the right way.

I'll try to write a test for this myself. I know how to trigger alternator TTL (we have tests for that already), but:

  1. It will need to be a multi-node test so it needs to be in test/topology*, not test/alternator - this issue is only triggered when sending reads to remote nodes.
  2. I'll need to somehow look at metrics or something to see which scheduling group is used or not use. I'm not sure how easy it will be :-(

By the way, I wonder if this fix only affects Alternator TTL. What about repair? For example, repair with materialized views has a background process of "view building", which writes view updates to remote nodes. Which scheduling group did those remote writes use before, and now? Or maybe writes are a different story from reads anyway?

@nyh
Copy link
Contributor

nyh commented May 19, 2024

By the way, @denesb, it would have been really nice to have a document in docs/dev which explains how the whole scheduling-inheritance-via-RPC business works. In the past I started docs/dev/isolation.md, but I didn't know how to explain RPC so I left a TODO, referring to commit 8c993e0 which by now is either obsolete or only part of the story.
I also mentioned priority classes (for IO) being separate from scheduling groups (for CPU), and I'm not sure this is still true.
I guess what I'm trying to say that perhaps I could inspire you to take a look at docs/dev/isolation.md and update it with the things you know and I didn't, or changed since I wrote it.

@nyh
Copy link
Contributor

nyh commented May 19, 2024

I'll try to write a test for this myself.

I'm working on this now.
By the way, if I remember correctly, @fruch believes that he saw the TTL performance problems (large latency jumps during the expiration scanning) during Alternator longevity tests with TTL - so he may be able to test this fix even without writing a new test. (although, I still want to write a new shorter test for this).

scfg.statement_tenants = {
{dbcfg.statement_scheduling_group, "$user"},
{default_scheduling_group(), "$system"},
{dbcfg.streaming_scheduling_group, "$maintenance"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens on upgrade?

I expect it should just work - an old server node will not understand $maintenance and direct it to the statement group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, an old server will just fall-back to the default tenant. Which is the pre-patch behaviour.

@avikivity
Copy link
Member

We'll want a test in test.py (relying on metrics rather than performance).

@nyh
Copy link
Contributor

nyh commented May 19, 2024

We'll want a test in test.py (relying on metrics rather than performance).

Yes, that's exactly my plan - a test in test.py (test/topology_experimental_raft - the directory organization there is a mess and there's no good place), which tries to see that the metrics for the default group increase too much while nothing is happening - e.g. imagine writing 1000 rows and then a second later they expire - at that point the work of the maintenance group should increase (we don't know exactly how much) but the work of the default group shouldn't increase at all

@denesb
Copy link
Contributor Author

denesb commented May 20, 2024

Thanks for fixing this! I can't say I fully understand the patch - the patch is trivial, but it doesn't seem to actually "do" anything, there has to be other code (which I'm not familiar with) which makes sure that if the new "tenant" exists, it is actually used in the right way.

This code already exists and it just has to be configured with all the different tenants that might be used by callers. Adding a maintenance tenant to the configuration, is all that is required to get this to work.

I'll try to write a test for this myself. I know how to trigger alternator TTL (we have tests for that already), but:

1. It will need to be a multi-node test so it needs to be in test/topology*, not test/alternator - this issue is only triggered when sending reads to remote nodes.

2. I'll need to somehow look at metrics or something to see which scheduling group is used or not use. I'm not sure how easy it will be :-(

By the way, I wonder if this fix only affects Alternator TTL. What about repair? For example, repair with materialized views has a background process of "view building", which writes view updates to remote nodes. Which scheduling group did those remote writes use before, and now? Or maybe writes are a different story from reads anyway?

Indeed, this fix will affect writes and even LWT (Paxos and Raft). This is actually good. Not preserving the scheduling group across RPC calls is surprising to say the least. Any change in the scheduling groups should be deliberate.

@denesb
Copy link
Contributor Author

denesb commented May 20, 2024

By the way, @denesb, it would have been really nice to have a document in docs/dev which explains how the whole scheduling-inheritance-via-RPC business works. In the past I started docs/dev/isolation.md, but I didn't know how to explain RPC so I left a TODO, referring to commit 8c993e0 which by now is either obsolete or only part of the story.
I also mentioned priority classes (for IO) being separate from scheduling groups (for CPU), and I'm not sure this is still true.

These are now unified, for around half-a year (starting with 5.4/2024.1).

I guess what I'm trying to say that perhaps I could inspire you to take a look at docs/dev/isolation.md and update it with the things you know and I didn't, or changed since I wrote it.

I will have a look.

@denesb
Copy link
Contributor Author

denesb commented May 20, 2024

I guess what I'm trying to say that perhaps I could inspire you to take a look at docs/dev/isolation.md and update it with the things you know and I didn't, or changed since I wrote it.

I will have a look.

See #18749

@avikivity
Copy link
Member

By the way, I wonder if this fix only affects Alternator TTL. What about repair? For example, repair with materialized views has a background process of "view building", which writes view updates to remote nodes. Which scheduling group did those remote writes use before, and now? Or maybe writes are a different story from reads anyway?

Indeed, this fix will affect writes and even LWT (Paxos and Raft). This is actually good. Not preserving the scheduling group across RPC calls is surprising to say the least. Any change in the scheduling groups should be deliberate.

It only affects queries initiated by the maintenance scheduling group. If it affected post-repair (and post-streaming) view building that's even more important than TTL, I'm surprised we haven't seen it yet.

I don't see how it can affect LWT.

It will affect Raft work if it's in the maintenance scheduling group, but that's low bandwidth anyway.

@denesb
Copy link
Contributor Author

denesb commented May 20, 2024

There is a test proposed in #18757, which confirms that this PR works. So this PR is ready to be merged.

@avikivity
Copy link
Member

It's better to fold the test into the fix, so if we backport it we don't forget the test.

@denesb denesb force-pushed the massaging-service-streaming-tenant branch from 18e15d3 to dae5aec Compare May 20, 2024 13:55
@denesb denesb changed the title main: add maintenance tenant to messaging_service's scheduling config alternator: keep TTL work in the maintenance scheduling group May 20, 2024
@denesb
Copy link
Contributor Author

denesb commented May 20, 2024

New in v2:

  • Include the test contributed by @nyh.
  • Update the title and cover-letter to explain what the problem with alternator TTL was and how this PR fixes it.

@nyh
Copy link
Contributor

nyh commented May 20, 2024

By the way, I wonder if this fix only affects Alternator TTL. What about repair? For example, repair with materialized views has a background process of "view building", which writes view updates to remote nodes. Which scheduling group did those remote writes use before, and now? Or maybe writes are a different story from reads anyway?

It only affects queries initiated by the maintenance scheduling group. If it affected post-repair (and post-streaming) view building that's even more important than TTL, I'm surprised we haven't seen it yet.

I wouldn't be surprised if we did see it - every once in a while we do have reports of problems with the post-repair view updates. I think we just folded these issue into the general "MV sucks" sentiment and "everything will be fixed as soon as we change MV's consistency/flow-control/whatever".

I already spent way too much time on the Alternator TTL test that proves that before this patch Alternator TTL work "leaked" into the statement group, but probably exactly the same approach can be used to prove (or disprove) that post- repair view build also leaked to the statement group.

I wonder if this also affects hint replay on a base table with materialized views.

Copy link
Contributor

@nyh nyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving, but because I wrote half of the PR (the test), I can't approve my own patch so we'll need more approvals.

@nyh
Copy link
Contributor

nyh commented May 22, 2024

@scylladb/scylla-maint please review and consider merging. Neither @denesb nor I can do it because we each wrote half the PR...

Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nyh
Copy link
Contributor

nyh commented May 22, 2024

@scylladb/scylla-maint please review and consider merging. Neither @denesb nor I can do it because we each wrote half the PR...

@mykaul this fixes the performance problems with Alternator TTL seen by several customers, I think we should try to get it in to 6.0.

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 alternator
🔹 topology_experimental_raft/test_mv_tablets
✅ - Container Test
✅ - dtest
✅ - dtest with topology changes
❌ - Unit Tests

Failed Tests (1/270245):

Build Details:

  • Duration: 5 hr 46 min
  • Builder: spider8.cloudius-systems.com

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
❌ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 alternator
🔹 topology_experimental_raft/test_mv_tablets

Failed Tests (8/239705):

Build Details:

  • Duration: 9 hr 41 min
  • Builder: i-0933865d427dea4ff (r5ad.8xlarge)

@denesb
Copy link
Contributor Author

denesb commented May 30, 2024

@nyh the test failed again, please have a look.
Maybe if we need multiple update rounds for the test, it is better to close this PR and for you to re-open one that you own.

@mykaul mykaul added P1 Urgent and removed backport/5.2 labels May 30, 2024
@nyh
Copy link
Contributor

nyh commented Jun 2, 2024

🔴 CI State: FAILURE

✅ - Build ❌ - Unit Tests Custom The following new/updated tests ran 100 times for each mode: 🔹 alternator 🔹 topology_experimental_raft/test_mv_tablets

Failed Tests (8/239705):

* [alternator.test_metrics.debug.9](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9323/testReport/junit/%28root%29/non-boost%20tests/alternator_test_metrics_debug_9) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+alternator.test_metrics.debug.9)

* [alternator.test_metrics.debug.15](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9323/testReport/junit/%28root%29/non-boost%20tests/alternator_test_metrics_debug_15) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+alternator.test_metrics.debug.15)

* [alternator.test_metrics.debug.19](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9323/testReport/junit/%28root%29/non-boost%20tests/alternator_test_metrics_debug_19) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+alternator.test_metrics.debug.19)

* [alternator.test_metrics.debug.26](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9323/testReport/junit/%28root%29/non-boost%20tests/alternator_test_metrics_debug_26) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+alternator.test_metrics.debug.26)

* [test_item_latency](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9323/testReport/junit/%28root%29/test_metrics/test_item_latency) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_item_latency)

* [test_item_latency](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9323/testReport/junit/%28root%29/test_metrics/test_item_latency_2) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_item_latency)

* [test_item_latency](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9323/testReport/junit/%28root%29/test_metrics/test_item_latency_3) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_item_latency)

* [test_item_latency](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9323/testReport/junit/%28root%29/test_metrics/test_item_latency_4) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_item_latency)

Build Details:

* Duration: 9 hr 41 min

* Builder: i-0933865d427dea4ff (r5ad.8xlarge)

Known flaky test #18847, which I need to fix but not related to this PR. I'll rerun the CI.

@nyh
Copy link
Contributor

nyh commented Jun 2, 2024

Known flaky test #18847, which I need to fix but not related to this PR. I'll rerun the CI.

Easier said than done - I can't seem to reach Jenkins so I couldn't restart the CI (I do the usual trick of rebasing the PR because @denesb owns it, not me). @yaronkaikov is Jenkins temporarily down? Some new security protocol I wasn't told about?

@yaronkaikov
Copy link
Contributor

Known flaky test #18847, which I need to fix but not related to this PR. I'll rerun the CI.

Easier said than done - I can't seem to reach Jenkins so I couldn't restart the CI (I do the usual trick of rebasing the PR because @denesb owns it, not me). @yaronkaikov is Jenkins temporarily down? Some new security protocol I wasn't told about?

It's not down, but very slow. we are checking it

@nyh
Copy link
Contributor

nyh commented Jun 2, 2024

🔴 CI State: FAILURE

✅ - Build ✅ - Unit Tests Custom The following new/updated tests ran 100 times for each mode: 🔹 alternator 🔹 topology_experimental_raft/test_mv_tablets ✅ - Container Test ✅ - dtest ✅ - dtest with topology changes ❌ - Unit Tests

Failed Tests (1/270245):

* [test_snapshot_cursor_is_consistent_with_merging](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9293/testReport/junit/boost/mvcc_test/test_snapshot_cursor_is_consistent_with_merging) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_snapshot_cursor_is_consistent_with_merging)

Build Details:

* Duration: 5 hr 46 min

* Builder: spider8.cloudius-systems.com

Another known (but apparently rare) flakiness: #13642

@nyh
Copy link
Contributor

nyh commented Jun 2, 2024

@scylladb/scylla-maint I restarted the CI, but I had a chat with @mykaul and he raised a good point:

When we run CI on an important patch (this patch fixes a P1 bug!), and it fails only one an already-known flaky test, what's the point of waiting another workday for the CI to run again? If the CI failed only on a test which we know has a pre-existing problem and is unrelated to the code being changed in the patch, we could consider the CI having basically passed. What's the point of waiting for the flaky test to succeed, and risk additional flaky tests suddenly failing, ad nauseum?

Let's find a way to get this PR committed. We've been "sitting on it" for two weeks already :-(

@nyh
Copy link
Contributor

nyh commented Jun 2, 2024

@yaronkaikov I tried, twice, to restart the CI, and I think it did run (but not sure how to check it now), but nothing changed in the "some checks were not successful" section above which continues to show some old error. So reviewers keep thinking that this PR is broken, when it isn't. Any idea what I can do? Did I do something wrong?

@yaronkaikov
Copy link
Contributor

it seems it still working https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/9400/ - currently in Unit Tests Custom stage

@nyh
Copy link
Contributor

nyh commented Jun 2, 2024

it seems it still working https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/9400/ - currently in Unit Tests Custom stage

Thanks. So why wasn't the first thing it did was to remove the old "checks were not successful" and replace them by "pending" marks, as usual? I thought this is what usually happens, but maybe I'm misremembering and that "pending" thing only happens in the first run and not subsequent runs? (if so, why?)

@yaronkaikov
Copy link
Contributor

it seems it still working https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/9400/ - currently in Unit Tests Custom stage

Thanks. So why wasn't the first thing it did was to remove the old "checks were not successful" and replace them by "pending" marks, as usual? I thought this is what usually happens, but maybe I'm misremembering and that "pending" thing only happens in the first run and not subsequent runs? (if so, why?)

It is not supposed to remove it (or at least it's not what we are doing until today :-), the status is pending for Unit Tests Custom so it means it's running now. I agree it can be not very clear, we can check if there is anything we can do about it

@nyh
Copy link
Contributor

nyh commented Jun 2, 2024

It is not supposed to remove it (or at least it's not what we are doing until today :-), the status is pending for Unit Tests Custom so it means it's running now. I agree it can be not very clear, we can check if there is anything we can do about it

Oh, somehow I missed the "Unit Test Custom" line. I looked at the "Unit Test" line which still shows some old failure. Is it because this test is not running now? Why is it not running now?

@yaronkaikov
Copy link
Contributor

yaronkaikov commented Jun 2, 2024

It is not supposed to remove it (or at least it's not what we are doing until today :-), the status is pending for Unit Tests Custom so it means it's running now. I agree it can be not very clear, we can check if there is anything we can do about it

Oh, somehow I missed the "Unit Test Custom" line. I looked at the "Unit Test" line which still shows some old failure. Is it because this test is not running now? Why is it not running now?

the flow is as follows

  1. Build
  2. Unit tests custom (optional, only if code changed for specific tests)
  3. Unit tests, dtest, and dtest with topology changes (running in parallel)

@nyh
Copy link
Contributor

nyh commented Jun 2, 2024

Oh, I wasn't aware the the post-build steps don't even start (and don't update the test result tab) until the build is done.

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 alternator
🔹 topology_experimental_raft/test_mv_tablets
✅ - Container Test
✅ - dtest
❌ - dtest with topology changes
✅ - Unit Tests

Build Details:

  • Duration: 4 hr 57 min
  • Builder: spider2.cloudius-systems.com

@avikivity
Copy link
Member

@scylladb/scylla-maint I restarted the CI, but I had a chat with @mykaul and he raised a good point:

When we run CI on an important patch (this patch fixes a P1 bug!), and it fails only one an already-known flaky test, what's the point of waiting another workday for the CI to run again? If the CI failed only on a test which we know has a pre-existing problem and is unrelated to the code being changed in the patch, we could consider the CI having basically passed. What's the point of waiting for the flaky test to succeed, and risk additional flaky tests suddenly failing, ad nauseum?

Let's find a way to get this PR committed. We've been "sitting on it" for two weeks already :-(

This is a good way to find excuses not to address CI flakiness problems.

@nyh
Copy link
Contributor

nyh commented Jun 3, 2024

This is a good way to find excuses not to address CI flakiness problems.

I'm not sure that a user who has been waiting two weeks for a fix that was already written cares so much about our tests flakiness problems :-( It doesn't mean, of course, that we shouldn't fix the tests, but user bugs shouldn't be held "hostage" until unrelated tests are fixed - that also doesn't make sense.

@mykaul
Copy link
Contributor

mykaul commented Jun 3, 2024

This one is even more interesting - it is failing on feature that does not exist yet in the version the user that is waiting for the fix has (in this specific case, topology over Raft).

@nyh
Copy link
Contributor

nyh commented Jun 3, 2024

🔴 CI State: FAILURE

✅ - Build ✅ - Unit Tests Custom The following new/updated tests ran 100 times for each mode: 🔹 alternator 🔹 topology_experimental_raft/test_mv_tablets ✅ - Container Test ✅ - dtest ❌ - dtest with topology changes ✅ - Unit Tests

Build Details:

* Duration: 4 hr 57 min

* Builder: spider2.cloudius-systems.com

The failure is in the dtest update_cluster_layout_tests.TestUpdateClusterLayout.test_increment_decrement_counters_in_threads_nodes_restarted

Known rare flakiness for about a year: https://github.com/scylladb/scylla-dtest/issues/3686

@avikivity
Copy link
Member

This is a good way to find excuses not to address CI flakiness problems.

I'm not sure that a user who has been waiting two weeks for a fix that was already written cares so much about our tests flakiness problems :-( It doesn't mean, of course, that we shouldn't fix the tests, but user bugs shouldn't be held "hostage" until unrelated tests are fixed - that also doesn't make sense.

Then the tests will never be fixed. You'll always find something more important, or someone who has doesn't care so much about our tests flakiness problems, and the problem will grow.

@kbr-scylla
Copy link
Contributor

@avikivity What do you expect?

The issue for the failing test was created. It was assigned to the appropriate team leader. The team leader (hopefully) did the planning, prioritized it and put it in the correct order in their team backlog. It will get addressed in the order of highest priority (there are other issues like that.)

And unrelated PR getting blocked from being merged will not change a thing about that process. It will not magically spawn more developers to allow backlogs getting cleared faster. Or improve productivity of existing developers.

OTOH perhaps delaying PRs from getting merged due to CI failures serves as a backpressure mechanism. The more CI failures we have, the less frequent merging PRs is, so we introduce regressions less frequently, and the backlog grows slower. I can see a point in that. BUT maybe we shouldn't do it by heating the planet (restarting CI again and again just so unrelated flaky test passes), perhaps we should make it more explicit somehow.

@mykaul
Copy link
Contributor

mykaul commented Jun 4, 2024

@avikivity What do you expect?

The issue for the failing test was created. It was assigned to the appropriate team leader. The team leader (hopefully) did the planning, prioritized it and put it in the correct order in their team backlog. It will get addressed in the order of highest priority (there are other issues like that.)

And unrelated PR getting blocked from being merged will not change a thing about that process. It will not magically spawn more developers to allow backlogs getting cleared faster. Or improve productivity of existing developers.

OTOH perhaps delaying PRs from getting merged due to CI failures serves as a backpressure mechanism. The more CI failures we have, the less frequent merging PRs is, so we introduce regressions less frequently, and the backlog grows slower. I can see a point in that. BUT maybe we shouldn't do it by heating the planet (restarting CI again and again just so unrelated flaky test passes), perhaps we should make it more explicit somehow.

A middle ground would be to allow bug fixes only to get merged (assuming the CI failures are unrelated to the fix) in spite of the CI failures. This is under the assumption it does make the code better (perhaps fixing other CI failures, etc.). Won't be allowed for improvements/features/enhancements/etc.
It's still a slippery slope. At the end of the day, it is about to the maintainers and the team leads to work with their teams to reduce their backlog of CI issues an I believe it's a better path than re-runs.

@avikivity
Copy link
Member

@avikivity What do you expect?

The issue for the failing test was created. It was assigned to the appropriate team leader. The team leader (hopefully) did the planning, prioritized it and put it in the correct order in their team backlog. It will get addressed in the order of highest priority (there are other issues like that.)

And unrelated PR getting blocked from being merged will not change a thing about that process. It will not magically spawn more developers to allow backlogs getting cleared faster. Or improve productivity of existing developers.

OTOH perhaps delaying PRs from getting merged due to CI failures serves as a backpressure mechanism. The more CI failures we have, the less frequent merging PRs is, so we introduce regressions less frequently, and the backlog grows slower. I can see a point in that. BUT maybe we shouldn't do it by heating the planet (restarting CI again and again just so unrelated flaky test passes), perhaps we should make it more explicit somehow.

I expect priority to be moved from things that are impacted by CI failures, to fixing the CI failures.

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
❌ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 alternator
🔹 topology_experimental_raft/test_mv_tablets

Failed Tests (6/239705):

Build Details:

  • Duration: 8 hr 51 min
  • Builder: i-0781cb8cde16924e6 (m5ad.8xlarge)

@nyh
Copy link
Contributor

nyh commented Jun 9, 2024

🔴 CI State: FAILURE

✅ - Build ❌ - Unit Tests Custom The following new/updated tests ran 100 times for each mode: 🔹 alternator 🔹 topology_experimental_raft/test_mv_tablets

Failed Tests (6/239705):

* [alternator.test_metrics.debug.2](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9471/testReport/junit/%28root%29/non-boost%20tests/alternator_test_metrics_debug_2) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+alternator.test_metrics.debug.2)

* [alternator.test_metrics.debug.4](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9471/testReport/junit/%28root%29/non-boost%20tests/alternator_test_metrics_debug_4) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+alternator.test_metrics.debug.4)

* [alternator.test_metrics.debug.33](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9471/testReport/junit/%28root%29/non-boost%20tests/alternator_test_metrics_debug_33) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+alternator.test_metrics.debug.33)

* [test_item_latency](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9471/testReport/junit/%28root%29/test_metrics/test_item_latency) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_item_latency)

* [test_item_latency](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9471/testReport/junit/%28root%29/test_metrics/test_item_latency_2) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_item_latency)

* [test_item_latency](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/9471/testReport/junit/%28root%29/test_metrics/test_item_latency_3) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_item_latency)

Build Details:

* Duration: 8 hr 51 min

* Builder: i-0781cb8cde16924e6 (m5ad.8xlarge)

The test_item_latency is known flakiness which I have already fixed in #19080 but wasn't merged yet because of some OTHER flaky test failing its CI. The irony :-(

I'll restart the CI.

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 alternator
🔹 topology_experimental_raft/test_mv_tablets
✅ - Container Test
✅ - dtest
✅ - dtest with topology changes
✅ - Unit Tests

Build Details:

  • Duration: 4 hr 44 min
  • Builder: spider8.cloudius-systems.com

avikivity added a commit that referenced this pull request Jun 9, 2024
… from Botond Dénes

Alternator has a custom TTL implementation. This is based on a loop, which scans existing rows in the table, then decides whether each row have reached its end-of-life and deletes it if it did. This work is done in the background, and therefore it uses the maintenance (streaming) scheduling group. However, it was observed that part of this work leaks into the statement scheduling group, competing with user workloads, negatively affecting its latencies. This was found to be causes by the reads and writes done on behalf of the alternator TTL, which looses its maintenance scheduling group when these have to go to a remote node. This is because the messaging service was not configured to recognize the streaming scheduling group, when statement verbs like read or writes are invoked. The messaging service currently recognizes two statement "tenants": the user tenant (statement scheduling group) and system (default scheduling group), as we used to have only user-initiated operations and sytsem (internal) ones. With alternator TTL, there is now a need to distinguish between two kinds of system operation: foreground and background ones. The former should use the system tenant while the latter will use the new maintenance tenant (streaming scheduling group).
This series adds a streaming tenant to the messaging service configuration and it adds a test which confirms that with this change, alternator TTL is entirely contained in the maintenance scheduling group.

Fixes: #18719

- [x] Scans executed on behalf of alternator TTL are running in the statement group, disturbing user-workloads, this PR has to be backported to fix this.

Closes #18729

* github.com:scylladb/scylladb:
  alternator, scheduler: test reproducing RPC scheduling group bug
  main: add maintenance tenant to messaging_service's scheduling config
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/internals an issue which refers to some internal class or something which has little exposure to users and is backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed backport/6.0 P1 Urgent
Projects
None yet
Development

Successfully merging this pull request may close these issues.

messaging_service: add streaming/maintenance tenant to statement tenants
9 participants