-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
alternator: keep TTL work in the maintenance scheduling group #18729
base: master
Are you sure you want to change the base?
Conversation
@nyh please review. I think this is all that is needed to make scans executed on behalf of alternator TTL keep their streaming scheduling group on the remote node. I want to add a test for this but don't know how to proceed. We can maybe use the infrastructure I introduced in #18705 but I don't even know how to trigger alternator TTL to scan a table. |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
Thanks for fixing this! I can't say I fully understand the patch - the patch is trivial, but it doesn't seem to actually "do" anything, there has to be other code (which I'm not familiar with) which makes sure that if the new "tenant" exists, it is actually used in the right way. I'll try to write a test for this myself. I know how to trigger alternator TTL (we have tests for that already), but:
By the way, I wonder if this fix only affects Alternator TTL. What about repair? For example, repair with materialized views has a background process of "view building", which writes view updates to remote nodes. Which scheduling group did those remote writes use before, and now? Or maybe writes are a different story from reads anyway? |
By the way, @denesb, it would have been really nice to have a document in docs/dev which explains how the whole scheduling-inheritance-via-RPC business works. In the past I started docs/dev/isolation.md, but I didn't know how to explain RPC so I left a TODO, referring to commit 8c993e0 which by now is either obsolete or only part of the story. |
I'm working on this now. |
scfg.statement_tenants = { | ||
{dbcfg.statement_scheduling_group, "$user"}, | ||
{default_scheduling_group(), "$system"}, | ||
{dbcfg.streaming_scheduling_group, "$maintenance"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens on upgrade?
I expect it should just work - an old server node will not understand $maintenance and direct it to the statement group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, an old server will just fall-back to the default tenant. Which is the pre-patch behaviour.
We'll want a test in test.py (relying on metrics rather than performance). |
Yes, that's exactly my plan - a test in test.py (test/topology_experimental_raft - the directory organization there is a mess and there's no good place), which tries to see that the metrics for the default group increase too much while nothing is happening - e.g. imagine writing 1000 rows and then a second later they expire - at that point the work of the maintenance group should increase (we don't know exactly how much) but the work of the default group shouldn't increase at all |
This code already exists and it just has to be configured with all the different tenants that might be used by callers. Adding a maintenance tenant to the configuration, is all that is required to get this to work.
Indeed, this fix will affect writes and even LWT (Paxos and Raft). This is actually good. Not preserving the scheduling group across RPC calls is surprising to say the least. Any change in the scheduling groups should be deliberate. |
These are now unified, for around half-a year (starting with 5.4/2024.1).
I will have a look. |
See #18749 |
It only affects queries initiated by the maintenance scheduling group. If it affected post-repair (and post-streaming) view building that's even more important than TTL, I'm surprised we haven't seen it yet. I don't see how it can affect LWT. It will affect Raft work if it's in the maintenance scheduling group, but that's low bandwidth anyway. |
There is a test proposed in #18757, which confirms that this PR works. So this PR is ready to be merged. |
It's better to fold the test into the fix, so if we backport it we don't forget the test. |
18e15d3
to
dae5aec
Compare
New in v2:
|
I wouldn't be surprised if we did see it - every once in a while we do have reports of problems with the post-repair view updates. I think we just folded these issue into the general "MV sucks" sentiment and "everything will be fixed as soon as we change MV's consistency/flow-control/whatever". I already spent way too much time on the Alternator TTL test that proves that before this patch Alternator TTL work "leaked" into the statement group, but probably exactly the same approach can be used to prove (or disprove) that post- repair view build also leaked to the statement group. I wonder if this also affects hint replay on a base table with materialized views. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving, but because I wrote half of the PR (the test), I can't approve my own patch so we'll need more approvals.
@scylladb/scylla-maint please review and consider merging. Neither @denesb nor I can do it because we each wrote half the PR... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
🔴 CI State: FAILURE✅ - Build Failed Tests (1/270245):Build Details:
|
🔴 CI State: FAILURE✅ - Build Failed Tests (8/239705):
Build Details:
|
@nyh the test failed again, please have a look. |
Known flaky test #18847, which I need to fix but not related to this PR. I'll rerun the CI. |
Easier said than done - I can't seem to reach Jenkins so I couldn't restart the CI (I do the usual trick of rebasing the PR because @denesb owns it, not me). @yaronkaikov is Jenkins temporarily down? Some new security protocol I wasn't told about? |
It's not down, but very slow. we are checking it |
Another known (but apparently rare) flakiness: #13642 |
@scylladb/scylla-maint I restarted the CI, but I had a chat with @mykaul and he raised a good point: When we run CI on an important patch (this patch fixes a P1 bug!), and it fails only one an already-known flaky test, what's the point of waiting another workday for the CI to run again? If the CI failed only on a test which we know has a pre-existing problem and is unrelated to the code being changed in the patch, we could consider the CI having basically passed. What's the point of waiting for the flaky test to succeed, and risk additional flaky tests suddenly failing, ad nauseum? Let's find a way to get this PR committed. We've been "sitting on it" for two weeks already :-( |
@yaronkaikov I tried, twice, to restart the CI, and I think it did run (but not sure how to check it now), but nothing changed in the "some checks were not successful" section above which continues to show some old error. So reviewers keep thinking that this PR is broken, when it isn't. Any idea what I can do? Did I do something wrong? |
it seems it still working https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/9400/ - currently in |
Thanks. So why wasn't the first thing it did was to remove the old "checks were not successful" and replace them by "pending" marks, as usual? I thought this is what usually happens, but maybe I'm misremembering and that "pending" thing only happens in the first run and not subsequent runs? (if so, why?) |
It is not supposed to remove it (or at least it's not what we are doing until today :-), the status is pending for |
Oh, somehow I missed the "Unit Test Custom" line. I looked at the "Unit Test" line which still shows some old failure. Is it because this test is not running now? Why is it not running now? |
the flow is as follows
|
Oh, I wasn't aware the the post-build steps don't even start (and don't update the test result tab) until the build is done. |
🔴 CI State: FAILURE✅ - Build Build Details:
|
This is a good way to find excuses not to address CI flakiness problems. |
I'm not sure that a user who has been waiting two weeks for a fix that was already written cares so much about our tests flakiness problems :-( It doesn't mean, of course, that we shouldn't fix the tests, but user bugs shouldn't be held "hostage" until unrelated tests are fixed - that also doesn't make sense. |
This one is even more interesting - it is failing on feature that does not exist yet in the version the user that is waiting for the fix has (in this specific case, topology over Raft). |
The failure is in the dtest update_cluster_layout_tests.TestUpdateClusterLayout.test_increment_decrement_counters_in_threads_nodes_restarted Known rare flakiness for about a year: https://github.com/scylladb/scylla-dtest/issues/3686 |
Then the tests will never be fixed. You'll always find something more important, or someone who has doesn't care so much about our tests flakiness problems, and the problem will grow. |
@avikivity What do you expect? The issue for the failing test was created. It was assigned to the appropriate team leader. The team leader (hopefully) did the planning, prioritized it and put it in the correct order in their team backlog. It will get addressed in the order of highest priority (there are other issues like that.) And unrelated PR getting blocked from being merged will not change a thing about that process. It will not magically spawn more developers to allow backlogs getting cleared faster. Or improve productivity of existing developers. OTOH perhaps delaying PRs from getting merged due to CI failures serves as a backpressure mechanism. The more CI failures we have, the less frequent merging PRs is, so we introduce regressions less frequently, and the backlog grows slower. I can see a point in that. BUT maybe we shouldn't do it by heating the planet (restarting CI again and again just so unrelated flaky test passes), perhaps we should make it more explicit somehow. |
A middle ground would be to allow bug fixes only to get merged (assuming the CI failures are unrelated to the fix) in spite of the CI failures. This is under the assumption it does make the code better (perhaps fixing other CI failures, etc.). Won't be allowed for improvements/features/enhancements/etc. |
I expect priority to be moved from things that are impacted by CI failures, to fixing the CI failures. |
🔴 CI State: FAILURE✅ - Build Failed Tests (6/239705):
Build Details:
|
The test_item_latency is known flakiness which I have already fixed in #19080 but wasn't merged yet because of some OTHER flaky test failing its CI. The irony :-( I'll restart the CI. |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
… from Botond Dénes Alternator has a custom TTL implementation. This is based on a loop, which scans existing rows in the table, then decides whether each row have reached its end-of-life and deletes it if it did. This work is done in the background, and therefore it uses the maintenance (streaming) scheduling group. However, it was observed that part of this work leaks into the statement scheduling group, competing with user workloads, negatively affecting its latencies. This was found to be causes by the reads and writes done on behalf of the alternator TTL, which looses its maintenance scheduling group when these have to go to a remote node. This is because the messaging service was not configured to recognize the streaming scheduling group, when statement verbs like read or writes are invoked. The messaging service currently recognizes two statement "tenants": the user tenant (statement scheduling group) and system (default scheduling group), as we used to have only user-initiated operations and sytsem (internal) ones. With alternator TTL, there is now a need to distinguish between two kinds of system operation: foreground and background ones. The former should use the system tenant while the latter will use the new maintenance tenant (streaming scheduling group). This series adds a streaming tenant to the messaging service configuration and it adds a test which confirms that with this change, alternator TTL is entirely contained in the maintenance scheduling group. Fixes: #18719 - [x] Scans executed on behalf of alternator TTL are running in the statement group, disturbing user-workloads, this PR has to be backported to fix this. Closes #18729 * github.com:scylladb/scylladb: alternator, scheduler: test reproducing RPC scheduling group bug main: add maintenance tenant to messaging_service's scheduling config
Alternator has a custom TTL implementation. This is based on a loop, which scans existing rows in the table, then decides whether each row have reached its end-of-life and deletes it if it did. This work is done in the background, and therefore it uses the maintenance (streaming) scheduling group. However, it was observed that part of this work leaks into the statement scheduling group, competing with user workloads, negatively affecting its latencies. This was found to be causes by the reads and writes done on behalf of the alternator TTL, which looses its maintenance scheduling group when these have to go to a remote node. This is because the messaging service was not configured to recognize the streaming scheduling group, when statement verbs like read or writes are invoked. The messaging service currently recognizes two statement "tenants": the user tenant (statement scheduling group) and system (default scheduling group), as we used to have only user-initiated operations and sytsem (internal) ones. With alternator TTL, there is now a need to distinguish between two kinds of system operation: foreground and background ones. The former should use the system tenant while the latter will use the new maintenance tenant (streaming scheduling group).
This series adds a streaming tenant to the messaging service configuration and it adds a test which confirms that with this change, alternator TTL is entirely contained in the maintenance scheduling group.
Fixes: #18719