`replicaof no one` hangs on replica when connectivity to master is broken #2907

safa-topal · 2024-04-15T13:56:03Z

Describe the bug
In a failover scenario to the replica instance, when a replica can't connect to its master anymore, replicaof no one command fails to return OK but instead hangs for a relatively long time. Observed in another shell that replica seems have accepted the config change and reports as master; so the command actually works. Simple workaround would be handling the timeout and ignoring it; however then it might cause false negatives later when other timeouts happen for more genuine reasons.

To Reproduce
Steps to reproduce the behavior:

Setup a master and replica DF.
Break the network connectivity between master and replica processes.
Issue replicaof no one on the replica node
See that the command request hangs for a relatively long time and client times out.

Expected behavior
Server must return OK response to the client.

Environment :

Dragonfly Version: 1.16.1

Additional context
1.14.5 doesn't have the same issue.

The text was updated successfully, but these errors were encountered:

kostasrim · 2024-04-15T17:19:09Z

Hi @safa-topal, thank you for reporting this. I will take a look and come back :)

kostasrim · 2024-04-15T17:45:06Z

Hi @safa-topal, it doesn't seem to reproduce on my side so maybe I am doing something wrong?

So:

I started DF master and replica
I debug populated master via redis-cli
I killed the master process which made replica try to reconnect
I run replicaof no one and it returned OK immediately

safa-topal · 2024-04-16T11:10:05Z

Hi @kostasrim,

I can reproduce it with roughly the same flow (except on the 3rd step I cut the connectivity with iptables config instead of killing the master process).

Curious, are there any logs indicating that replica is trying to reconnect to master? I don't see any such logs during my tests.

kostasrim · 2024-04-16T12:01:29Z

Hi @safa-topal I can retry with iptables config instead but I am quite packed for the rest of the day.

I used --alsologtostderr and it should print that the replica is trying to restablish the connection with master.

p.s. I doubt that iptables config is the problem but I should verify

safa-topal · 2024-04-16T13:05:14Z

I also don't think iptables will make a difference on this case.

I have the -alsologtostderr enabled and didn't make a difference with the logs after network failure, it doesn't log anything related to retrying connection with master.

kostasrim · 2024-04-16T13:11:12Z

Your logs should be filled:

E20240416 16:10:54.929634 28736 replica.cc:196] Error connecting to localhost:6379 system:111
I20240416 16:10:55.429829 28736 protocol_client.cc:180] Resolved endpoint localhost:6379 to 127.0.0.1:6379
E20240416 16:10:55.429966 28736 protocol_client.cc:221] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
E20240416 16:10:55.430018 28736 replica.cc:196] Error connecting to localhost:6379 system:111
I20240416 16:10:55.930435 28736 protocol_client.cc:180] Resolved endpoint localhost:6379 to 127.0.0.1:6379

Which is suspicious that you don't get them. I will try with iptables and ping back

safa-topal · 2024-04-16T13:18:32Z

Yes, looks suspicious. There's no activity in the logs of replica I have until my code runs replicaof no one after sensing that master is having connectivity issues:

Apr 16 13:14:48 dragonfly-24c3a611-7 taskset[2005]: I20240416 13:14:48.620872  2006 protocol_client.cc:180] Resolved endpoint dragonfly-24c3a611-6.testproject-dw2f.aiven.local:28849 to fda7:a938:5bfe:5fa6:0:4a9:374:f236:28849
Apr 16 13:14:48 dragonfly-24c3a611-7 taskset[2005]: I20240416 13:14:48.628690  2006 replica.cc:536] Started full sync with dragonfly-24c3a611-6.testproject-dw2f.aiven.local:28849
Apr 16 13:14:48 dragonfly-24c3a611-7 taskset[2005]: I20240416 13:14:48.629586  2006 replica.cc:556] full sync finished in 3 ms
Apr 16 13:14:48 dragonfly-24c3a611-7 taskset[2005]: I20240416 13:14:48.629644  2006 replica.cc:627] Transitioned into stable sync
Apr 16 13:16:27 dragonfly-24c3a611-7 taskset[2005]: I20240416 13:16:27.574677  2006 server_family.cc:2378] Replicating NO ONE

kostasrim · 2024-04-16T13:35:36Z

I am pretty sure there is a logical explanation about this. Let me get back to you :)

romange · 2024-04-16T14:27:39Z

@safa-topal what happens if you kill/stop the master without the iptables?

safa-topal · 2024-04-16T14:31:44Z

@romange on that case, what happens is identical to what @kostasrim experiences:

Apr 16 14:29:09 dragonfly-24c3a611-8 taskset[68]: I20240416 14:29:09.430425    70 replica.cc:657] Exit stable sync
Apr 16 14:29:09 dragonfly-24c3a611-8 taskset[68]: W20240416 14:29:09.430497    70 replica.cc:244] Error stable sync with dragonfly.local:28849 system:103 Software caused connection abort
Apr 16 14:29:09 dragonfly-24c3a611-8 taskset[68]: I20240416 14:29:09.930770    70 protocol_client.cc:180] Resolved endpoint dragonfly.local:28849 to fda7:a938:5bfe:5fa6:0:4a9:afc:ba6e:28849
Apr 16 14:29:09 dragonfly-24c3a611-8 taskset[68]: E20240416 14:29:09.931649    70 replica.cc:196] Error connecting to dragonfly.local:28849 system:111

romange · 2024-04-16T14:38:28Z

when you use iptables, a replica does not know that the master is dead and it relies on tcp keep alive settings to recognize a closed socket. And, I just checked - we do not actually configure TCP keep alive on our replication connections on replica, instead we have the following comment https://github.com/dragonflydb/dragonfly/blob/main/src/server/protocol_client.cc#L225

kostasrim · 2024-04-16T14:53:26Z

@romange I remember Roy experimented with those... I can take a look tomorrow

romange · 2024-04-16T15:13:52Z

@kostasrim I do not think they will help with the original issue of "replica no one" being stuck. I would actually check at what step it is stuck when performing this command. My guess is socket->Shutdown call but I am not sure.
@safa-topal could you please tell us how exactly you block traffic with iptables?

romange · 2024-04-16T15:16:11Z

when you use iptables, a replica does not know that the master is dead and it relies on tcp keep alive settings to recognize a closed socket. And, I just checked - we do not actually configure TCP keep alive on our replication connections on replica, instead we have the following comment https://github.com/dragonflydb/dragonfly/blob/main/src/server/protocol_client.cc#L225

This comment explained why we do not see the "reconnect" logs, but it does not explain why "replica no one" hangs

kostasrim · 2024-04-17T11:07:46Z

After some digging with iptables I managed to reproduce :) I will debug and patch this.

kostasrim · 2024-04-17T13:31:13Z

Hi @safa-topal , just to be 100% sure, you dropped the connection via:

sudo iptables -A INPUT -s 127.0.0.1 -p tcp --dport 6379 -j DROP or something similar ?

I found the bug and patched it but I wanted to double check.

safa-topal · 2024-04-17T14:14:36Z

Hİ @kostasrim, yes, looks similar. This is the iptable directives I've used:

-A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
-A OUTPUT -p tcp --sport 22 -m state --state ESTABLISHED -j ACCEPT
-A INPUT -j DROP
-A OUTPUT -j DROP

kostasrim · 2024-04-17T17:46:28Z

@safa-topal hmm I wonder if we see the same issue now because:

-A INPUT -j DROP drops all packets destined for the host computer.
-A OUTPUT -j DROP drops all packets originating from the host computer

Both of them have side effects in this context:

For (1) redis-cli will stop working, so issuing replicaof no one will never reach replica because the packets are dropped when received
For(2) redis-cli will stop working, because the packets are droped when they are sent

Also both of them will make the system unstable (on my ubuntu it crashes a few things) as it drops all kinds of packets. So in this context I think these two are improper (meaning that my patch won't fix them since the issue is with how iptables is used).

Also both of them do not work with version 1.14.5

Now for the:

-A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
-A OUTPUT -p tcp --sport 22 -m state --state ESTABLISHED -j ACCEPT

Doesn't actually have an effect because it modifies the ACCEPT which is later DROPED. If taken standalone it has no impact. Whereas if you use this:

sudo iptables -A INPUT -s 127.0.0.1 -p tcp --dport 6379 -j DROP it will fail (now it's patched) because it will block all incoming traffic to the master.

Let me know if you see something different. I just want to be 100% sure we are on the same page.

safa-topal · 2024-04-18T14:02:03Z

Hi @kostasrim

Yes there's something different in my scenario compared to what you explain above. I run these iptables on the master node only so replica is not affected at all. Therefore below statements are not true when requests made against the replica but true "if" my code was making requests to master node -- which is not the case:

For (1) redis-cli will stop working, so issuing replicaof no one will never reach replica because the packets are dropped when received
For(2) redis-cli will stop working, because the packets are droped when they are sent

Once master gets these iptables configuration then my failover code gets executed and then replica gets promoted to be master. There are no issues with connection to the replica at this stage.

kostasrim · 2024-04-19T09:35:58Z

Oh I see, if it's specific to the master instance then it has similar semantics with my config as well. I expect my patch to work :)

safa-topal added the bug Something isn't working label Apr 15, 2024

kostasrim self-assigned this Apr 15, 2024

kostasrim mentioned this issue Apr 17, 2024

fix: remove acl-check and cancel instead when REPLCONF ACK fails to validate #2920

Merged

kostasrim closed this as completed in #2920 May 1, 2024

safa-topal mentioned this issue May 7, 2024

replicaof no one on returns after long time while running memtier on master #3020

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`replicaof no one` hangs on replica when connectivity to master is broken #2907

`replicaof no one` hangs on replica when connectivity to master is broken #2907

safa-topal commented Apr 15, 2024

kostasrim commented Apr 15, 2024

kostasrim commented Apr 15, 2024 •

edited

safa-topal commented Apr 16, 2024

kostasrim commented Apr 16, 2024

safa-topal commented Apr 16, 2024

kostasrim commented Apr 16, 2024 •

edited

safa-topal commented Apr 16, 2024

kostasrim commented Apr 16, 2024

romange commented Apr 16, 2024

safa-topal commented Apr 16, 2024

romange commented Apr 16, 2024

kostasrim commented Apr 16, 2024

romange commented Apr 16, 2024

romange commented Apr 16, 2024 •

edited

kostasrim commented Apr 17, 2024

kostasrim commented Apr 17, 2024

safa-topal commented Apr 17, 2024

kostasrim commented Apr 17, 2024

safa-topal commented Apr 18, 2024

kostasrim commented Apr 19, 2024

replicaof no one hangs on replica when connectivity to master is broken #2907

replicaof no one hangs on replica when connectivity to master is broken #2907

Comments

safa-topal commented Apr 15, 2024

kostasrim commented Apr 15, 2024

kostasrim commented Apr 15, 2024 • edited

safa-topal commented Apr 16, 2024

kostasrim commented Apr 16, 2024

safa-topal commented Apr 16, 2024

kostasrim commented Apr 16, 2024 • edited

safa-topal commented Apr 16, 2024

kostasrim commented Apr 16, 2024

romange commented Apr 16, 2024

safa-topal commented Apr 16, 2024

romange commented Apr 16, 2024

kostasrim commented Apr 16, 2024

romange commented Apr 16, 2024

romange commented Apr 16, 2024 • edited

kostasrim commented Apr 17, 2024

kostasrim commented Apr 17, 2024

safa-topal commented Apr 17, 2024

kostasrim commented Apr 17, 2024

safa-topal commented Apr 18, 2024

kostasrim commented Apr 19, 2024

`replicaof no one` hangs on replica when connectivity to master is broken #2907

`replicaof no one` hangs on replica when connectivity to master is broken #2907

kostasrim commented Apr 15, 2024 •

edited

kostasrim commented Apr 16, 2024 •

edited

romange commented Apr 16, 2024 •

edited