Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aarch64, debug] topology_experimental_raft/test_tablets failed with ReadTimeout #18718

Closed
scylladb-promoter opened this issue May 17, 2024 · 2 comments
Assignees
Labels
area/tablets symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/flaky A problem with a test, having flaky behavior triage/master Looking for assignee

Comments

@scylladb-promoter
Copy link
Contributor

https://jenkins.scylladb.com/job/scylla-master/job/next/7683/ failed with the following error:


=================================== FAILURES ===================================
______________________________ test_tablet_split _______________________________

manager = <test.pylib.manager_client.ManagerClient object at 0xffff486ddb10>

    @pytest.mark.asyncio
    @skip_mode('release', 'error injections are not supported in release mode')
    async def test_tablet_split(manager: ManagerClient):
        logger.info("Bootstrapping cluster")
        cmdline = [
            '--logger-log-level', 'storage_service=debug',
            '--logger-log-level', 'table=debug',
            '--target-tablet-size-in-bytes', '1024',
        ]
        servers = [await manager.server_add(cmdline=cmdline)]
    
        await manager.api.disable_tablet_balancing(servers[0].ip_addr)
    
        cql = manager.get_cql()
        await cql.run_async("CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 1};")
        await cql.run_async("CREATE TABLE test.test (pk int PRIMARY KEY, c int);")
    
        # enough to trigger multiple splits with max size of 1024 bytes.
        keys = range(256)
        await asyncio.gather(*[cql.run_async(f"INSERT INTO test.test (pk, c) VALUES ({k}, {k});") for k in keys])
    
        async def check():
            logger.info("Checking table")
            cql = manager.get_cql()
            rows = await cql.run_async("SELECT * FROM test.test;")
            assert len(rows) == len(keys)
            for r in rows:
                assert r.c == r.pk
    
        await check()
    
        await manager.api.flush_keyspace(servers[0].ip_addr, "test")
    
        tablet_count = await get_tablet_count(manager, servers[0], 'test', 'test')
        assert tablet_count == 1
    
        logger.info("Adding new server")
        servers.append(await manager.server_add(cmdline=cmdline))
    
        # Increases the chance of tablet migration concurrent with split
        await inject_error_one_shot_on(manager, "tablet_allocator_shuffle", servers)
        await inject_error_on(manager, "tablet_load_stats_refresh_before_rebalancing", servers)
    
        s1_log = await manager.server_open_log(servers[0].server_id)
        s1_mark = await s1_log.mark()
    
        # Now there's a split and migration need, so they'll potentially run concurrently.
        await manager.api.enable_tablet_balancing(servers[0].ip_addr)
    
        await check()
        time.sleep(5) # Give load balancer some time to do work
    
        await s1_log.wait_for('Detected tablet split for table', from_mark=s1_mark)
    
>       await check()

test/topology_experimental_raft/test_tablets.py:723: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    async def check():
        logger.info("Checking table")
        cql = manager.get_cql()
>       rows = await cql.run_async("SELECT * FROM test.test;")
E       cassandra.ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out for test.test - received only 0 responses from 1 CL=LOCAL_QUORUM." info={'consistency': 'LOCAL_QUORUM', 'required_responses': 1, 'received_responses': 0}

test/topology_experimental_raft/test_tablets.py:693: ReadTimeout
------------------------------ Captured log setup ------------------------------
@scylladb-promoter scylladb-promoter added symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/flaky A problem with a test, having flaky behavior triage/master Looking for assignee labels May 17, 2024
@bhalevy
Copy link
Member

bhalevy commented May 19, 2024

@denesb / @raphaelsc I don't know if and how the failure is related to tablet splitting, but the te name suggest they might be related.
Can you please look into this?

@bhalevy bhalevy assigned denesb and raphaelsc and unassigned bhalevy May 19, 2024
@raphaelsc
Copy link
Member

dup of #18085, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tablets symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/flaky A problem with a test, having flaky behavior triage/master Looking for assignee
Projects
None yet
Development

No branches or pull requests

5 participants