-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: test_topology_ops: adapt to tablets #18707
base: master
Are you sure you want to change the base?
Conversation
the test passes locally. but it is failing on jenkins.. looking.. |
in e7d4e08, we reenabled the background writes in this test, but when running with tablets enabled, background writes are still disabled because of scylladb#17025, which was fixed last week. so we can enable background writes with tablets. in this change, * background writes are enabled with tablets. * increase the number of nodes by 1 so that we have enough nodes to fulfill the needs of tablets, which enforces that the number of replicas should always satisfy RF. * pass rf to `start_writes()` explicitly, so we have less magic numbers in the test, and make the data dependencies more obvious. Fixes scylladb#17589 Signed-off-by: Kefu Chai <[email protected]>
1c3d67d
to
23f6535
Compare
🔴 CI State: FAILURE❌ - Build Build Details:
|
Restarted CI build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good if the test passes in CI
🔴 CI State: ABORTED✅ - Build Failed Tests (212/298):
Build Details:
|
the test is failing. looking.. |
But also for the case But for -- and if so, this would mean that the test was already flaky for |
I see:
It's one of the "no ip mapping" issues or a new one. The similar issues are #18676 and #18843. This test fails with tablets disabled and no disrupted topology operations, so I guess it's a new one. |
Damn it :( |
It could be the same root cause as #18843 -- that LEFT nodes are still inserted into @gleb-cloudius explained here #18843 (comment) |
No, without tablets the code is disabled. But I do not see any evidence yet that the failure here with and without tablets are the same. Logs without fail on timeout and with on something else. I will probably take 10 more years for me to download logs, so I cannot check better. |
@tchaikov you aborted twice: once to cancel test run, second time in the middle of artifacts collection phase, so we're missing a lot of logs unfortunately |
`test_topology_ops` is flaky, which has been uncovered by gating in scylladb#18707. However, debugging it is harder than it should be because write workers can flood the logs. They may send a lot of failed writes before the test fails. Then, the log file can become huge, even up to 20 GB. Fix this issue by stopping a write worker after the first error.
`test_topology_ops` is flaky, which has been uncovered by gating in scylladb#18707. However, debugging it is harder than it should be because write workers can flood the logs. They may send a lot of failed writes before the test fails. Then, the log file can become huge, even up to 20 GB. Fix this issue by stopping a write worker after the first error. This test is important for 6.0, so we can backport this change.
@@ -24,7 +24,13 @@ | |||
@pytest.mark.parametrize("tablets_enabled", ["true", "false"]) | |||
async def test_topology_ops(request, manager: ManagerClient, tablets_enabled: bool): | |||
"""Test basic topology operations using the topology coordinator.""" | |||
cfg = {'experimental_features' : ['tablets']} if tablets_enabled else {} | |||
rf = 3 | |||
if tablets_enabled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't tablets_enabled
is a non empty string here which is always True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and we have multiple (5 if I'm not wrong) tests broken in the same way. The first bug served as a bad example that caused the other bugs. We need to fix these tests and check if they pass.
The bigger problem with this particular test is that before this PR we didn't send writes if tablets were enabled. But they were always enabled, so we never sent writes. The test was basically disabled. This is very bad for such a basic and important test.
Hopefully, the situation is not that bad. IIRC, I verified that this test was not flaky before incorrectly parameterizing it with tablets around 3 month ago. So, I hope that "missing IP mapping" is the only regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and we have multiple (5 if I'm not wrong) tests broken in the same way. The first bug served as a bad example that caused the other bugs. We need to fix these tests and check if they pass.
Only one test was broken -- test_default_tombstone_gc
. I opened #18888.
I'll be investigating failures in this test after re-enabling writes. This PR will have to wait until all detected bugs are fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failures are because the test is always running with tablets (#18707 (comment)) and it is a known problem with tablets #18843
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, I'll only check if it passes without tablets and write workers re-enabled. We were running the test with write workers always disabled (because tablets were always enabled) for 3 months.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but my observation is quite the opposite , it fails with write load enabled and with tablets disabled. see #18932 . EDIT, please wait. i just rebased it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to move the test from topology_experimental_raft since this suit enables tablets by default. See #18707 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed it in #18900 by changing the config.
The test is in |
`test_topology_ops` is flaky, which has been uncovered by gating in #18707. However, debugging it is harder than it should be because write workers can flood the logs. They may send a lot of failed writes before the test fails. Then, the log file can become huge, even up to 20 GB. Fix this issue by stopping a write worker after the first error. This test is important for 6.0, so we can backport this change. Closes #18851
`test_topology_ops` is flaky, which has been uncovered by gating in #18707. However, debugging it is harder than it should be because write workers can flood the logs. They may send a lot of failed writes before the test fails. Then, the log file can become huge, even up to 20 GB. Fix this issue by stopping a write worker after the first error. This test is important for 6.0, so we can backport this change. (cherry picked from commit 4dd2b7a)
if i didn't abort the test, the test node would be full of log files and running out of space. i spotted this the other time. |
#18851 fixed the huge log file issue. Anyway, we now know that the test is flaky only with tablets enabled because of #18843. This PR is blocked by that issue. |
@patjed41 thanks Patryk. so at least we can bring the write load back. i just created #18932 to implement it. but it fails locally with the same symptom of #18843
|
@tchaikov #18900 should solve all problems in this test. So, if I'm not wrong, this PR is ready and is only waiting for a fix for #18843. Then, we can run CI again and see if it passes. If it does, we can merge this PR. You are probably confused because I touched this test even though you are assigned to this issue. But because of the blunders fixed by #18900, we were running this test only with tablets and with write workers disabled for 3 months. We wanted to verify raft-topology ASAP. |
@patjed41 so, may i assign this ticket back to you? since you've done the heavy lifting. EDIT, oh, wait. so it's just a step away. i am fine either way. |
This issue is about ensuring that this test passes with tablets and write workers enabled. If the test fails again after fixing #18843, it should be investigated by someone with tablets knowledge (I don't really have it). Maybe we shouldn't change the assignment. |
`test_topology_ops` is flaky, which has been uncovered by gating in #18707. However, debugging it is harder than it should be because write workers can flood the logs. They may send a lot of failed writes before the test fails. Then, the log file can become huge, even up to 20 GB. Fix this issue by stopping a write worker after the first error. This test is important for 6.0, so we can backport this change. (cherry picked from commit 7c1e6ba) Closes #18914
in e7d4e08, we reenabled the background writes in this test, but when running with tablets enabled, background writes are still disabled because of #17025, which was fixed last week. so we can enable background writes with tablets.
in this change,
start_writes()
explicitly, so we have less magic numbers in the test, and make the data dependencies more obvious.Fixes #17589
Signed-off-by: Kefu Chai [email protected]