Redis [BUG] Redis cluster fails to trigger election by printing log "Error condition on socket for SYNC: Connection refused"

Describe the bug

After 2 worker nodes on which 2 Redis primaries were running, were rebooted using reboot command, the replicas print the log "Error condition on socket for SYNC: Connection refused" and then print Connection with Node b48b97dc848d1b6c730261c1990e8334d5d3faed at 10.244.174.199:16379 failed: Connection refused for almost 20 seconds and only after that election got triggered. cluster-node-timeout is 15seconds.

We have 5 primaries with 2 replicas for each

Why redis cluster replica pods kept printing the log "Connection with Node 216dd660bda9f137882e24405f748f449937ea10 at 10.244.174.175:16379 failed: Connection refused" instead of triggering election?

To reproduce

This may always not be reproducible, but we would like to minimize such occurrences. 1. Reboot 2 worker nodes on which 2 Redis primaries are running. 2. Use reboot command to reboot the worker nodes

Expected behavior

Redis cluster should have immediately triggered election and replica should have begun election process

Additional information Worker nodes 8tgsd and zf4kq were rebooted

+6tlhx on worker worker 8tgsd -ncvcm -chc9n on worker zf4kq

+8cd22 on worker zf4kq -ghmsv

6tlhx:( 10.244.164.132) worker 8tgsd on Zone 3

I0208 17:19:30.079887 1 log.go:16] 22:signal-handler (1707412770) Received SIGTERM scheduling shutdown... I0208 17:19:30.079910 1 main.go:50] signal received: terminated, stop the process Receive Stop Signal...","time":"2024-02-08T17:19:30.080048648Z

ncvcm (replica of 6tlhx)

22:S 08 Feb 2024 17:19:27.484 - Connection with Node b48b97dc848d1b6c730261c1990e8334d5d3faed at 10.244.174.199:16379 failed: Connection refused --> This is the other primary 8cd22 which was down 22:S 08 Feb 2024 17:19:27.785 - Connection with Node 216dd660bda9f137882e24405f748f449937ea10 at 10.244.174.175:16379 failed: Connection refused --> Replica(chc9n) of 6tlhx 22:S 08 Feb 2024 17:19:30.450 * Reconnecting to MASTER 10.244.164.132:6379 22:S 08 Feb 2024 17:19:30.450 # Error condition on socket for SYNC: Connection refused 22:S 08 Feb 2024 17:19:32.612 * Connecting to MASTER 10.244.164.132:6379 22:S 08 Feb 2024 17:19:47.277 - Connection with Node b48b97dc848d1b6c730261c1990e8334d5d3faed at 10.244.174.199:16379 failed: Connection refused --> There are 338 hits of this log till this TS; Not sure why so many retres to other primary node 22:S 08 Feb 2024 17:19:47.277 - Connection with Node 216dd660bda9f137882e24405f748f449937ea10 at 10.244.174.175:16379 failed: Connection refused --> thr are 333 hits 22:S 08 Feb 2024 17:19:52.696 # Start of election delayed for 709 milliseconds (rank #0, offset 5579473755) 22:S 08 Feb 2024 17:19:53.501 # Failover election won: I'm the new master

chc9n:(replica of 6tlhx) worker zf4kq on Zone 2

22:S 08 Feb 2024 17:19:21.532 - DB 0: 56907 keys (0 volatile) in 65536 slots --> Last log, looks like abruptly shut

8cd22: worker zf4kq on Zone 2

18:M 08 Feb 2024 17:19:24.214 - Client closed connection id=21523 addr=127.0.0.1:48806 laddr=127.0.0.1:6379 --> last log, no log of SIGTERM found, looks like abruptly shut down

ghmsv (replica of 8cd22)

17:S 08 Feb 2024 17:19:27.400 * Reconnecting to MASTER 10.244.174.199:6379

I0208 17:19:27.400355 1 log.go:16] 17:S 08 Feb 2024 17:19:27.400 * MASTER <-> REPLICA sync started I0208 17:19:27.400504 1 log.go:16] 17:S 08 Feb 2024 17:19:27.400 # Error condition on socket for SYNC: Connection refused I0208 17:19:27.400629 1 log.go:16] 17:S 08 Feb 2024 17:19:27.400 - Connection with Node b48b97dc848d1b6c730261c1990e8334d5d3faed at 10.244.174.199:16379 failed: Connection refused --> Conn refused with it's primary 1st log I0208 17:19:27.902949 1 log.go:16] 17:S 08 Feb 2024 17:19:27.902 - Connection with Node 216dd660bda9f137882e24405f748f449937ea10 at 10.244.174.175:16379 failed: Connection refused --> of other replica of 6tlhx I0208 17:19:47.291344 1 log.go:16] 17:S 08 Feb 2024 17:19:47.291 - Connection with Node b48b97dc848d1b6c730261c1990e8334d5d3faed at 10.244.174.199:16379 failed: Connection refused --> last log with 199 hits I0208 17:19:47.291460 1 log.go:16] 17:S 08 Feb 2024 17:19:47.291 - Connection with Node 216dd660bda9f137882e24405f748f449937ea10 at 10.244.174.175:16379 failed: Connection refused --> last log with 195 hits I0208 17:19:48.696418 1 log.go:16] 17:S 08 Feb 2024 17:19:48.696 # Start of election delayed for 678 milliseconds (rank #0, offset 5626765517). I0208 17:19:49.400345 1 log.go:16] 17:S 08 Feb 2024 17:19:49.400 # Failover election won: I'm the new master.

Comment From: madolson

22:S 08 Feb 2024 17:19:30.450 * Reconnecting to MASTER 10.244.164.132:6379
22:S 08 Feb 2024 17:19:30.450 # Error condition on socket for SYNC: Connection refused

This is fine, this is just a statement that the replica was trying to perform replication with the master. This is a bit of a red herring.

Comment From: madolson

b48b97dc848d1b6c730261c1990e8334d5d3faed at 10.244.174.199:16379 failed: Connection refused for almost 20 seconds and only after that election got triggered. cluster-node-timeout is 15seconds.

So, to make sure I understand this correctly. Everything happened that you expected to happen, it just happened slower than you expected. The cluster-node-timeout is the minimum amount of time for a failover, so a ~20 second failure with a 15 second timeout isn't outrageous, just seems a bit slow.

Comment From: madolson

Do you have other logs from the other masters? There is more information such as when the other masters indicated the other masters were dead which would be helpful in understanding this situation.

I would also like to ask generally, why did you issue reboot on the masters?

Comment From: Anuradha-26

Thanks @madolson, posting few things again here for clarity:

Redis version: Redis 7.0.12

we are testing some resiliency scenarios as part of which 2 worker nodes were rebooted. Both these workers hosted 2 Primary pods. cluster-node-timeout : 15 seconds:

pasting logs from other primaries:

rp6v6: (another primary)

I0208 17:19:27.448616 1 log.go:16] 22:M 08 Feb 2024 17:19:27.448 - Connection with Node b48b97dc848d1b6c730261c1990e8334d5d3faed at 10.244.174.199:16379 failed: Connection refused I0208 17:19:48.637718 1 log.go:16] 22:M 08 Feb 2024 17:19:48.637 * FAIL message received from 21d8621cee545002fb82d746de1cf50aaff1db40 about b48b97dc848d1b6c730261c1990e8334d5d3faed (This receives 1st fail msg only at this time --> strange) I0208 17:19:51.094206 1 log.go:16] 22:M 08 Feb 2024 17:19:51.094 * Marking node 40a7c5e8ff0d962d79086c4aaca28335e248447a as failing (quorum reached).

2vhhk: (another primary)

I0208 17:19:27.406756 1 log.go:16] 22:M 08 Feb 2024 17:19:27.406 - Connection with Node b48b97dc848d1b6c730261c1990e8334d5d3faed at 10.244.174.199:16379 failed: Connection refused I0208 17:19:48.637627 1 log.go:16] 22:M 08 Feb 2024 17:19:48.637 - Node 21d8621cee545002fb82d746de1cf50aaff1db40 reported node b48b97dc848d1b6c730261c1990e8334d5d3faed as not reachable. 22:M 08 Feb 2024 17:19:48.637 * Marking node b48b97dc848d1b6c730261c1990e8334d5d3faed as failing (quorum reached).

6xlkk:(another primary)

I0208 17:19:27.488909 1 log.go:16] 19:M 08 Feb 2024 17:19:27.488 - Connection with Node b48b97dc848d1b6c730261c1990e8334d5d3faed at 10.244.174.199:16379 failed: Connection refused I0208 17:19:48.637405 1 log.go:16] 19:M 08 Feb 2024 17:19:48.637 - Node 1f29c79918b23aa8deaf212753ef835ccffa1fd3 reported node b48b97dc848d1b6c730261c1990e8334d5d3faed as not reachable. 19:M 08 Feb 2024 17:19:48.637 * Marking node b48b97dc848d1b6c730261c1990e8334d5d3faed as failing (quorum reached).

if you notice at 17:19:27 itself, "Connection refused" was observed with the primary b48b97dc848d1b6c730261c1990e8334d5d3faed.

But only at 17:19:48 the other primaries print a quorum reached message (21 seconds).

1) What i know about cluster-node-timeout is that it's the maximum amount of time a node can remain unreachable/error state. this case, as it's 15 seconds, i expect redis cluster to trigger election immediately after 15 seconds, why was additional 6 seconds taken? for what? what more redis configurations can we fine tune?

2) Does replica do election only after other primaries conclude that a primary was down (by printing quorum reached?)

Comment From: madolson

this case, as it's 15 seconds, i expect redis cluster to trigger election immediately after 15 seconds, why was additional 6 seconds taken? for what? what more redis configurations can we fine tune?

I agree, the extra 6 seconds is weird and I don't immediately have an answer based on the logs (all the primaries seem to agree together the node was dead). I won't get to this today, but I'll do some local testing to see what the variability of failure time is. The architecture itself is pretty limited in how fast it can trigger failovers, we're trying to revamp it. In the meantime you might try reducing the cluster-node-timeout to see if the elections are triggered fast enough for your needs.

Does replica do election only after other primaries conclude that a primary was down (by printing quorum reached?)

Yes. Replicas only nominate themselves once a quorum of primaries believe that the replica's priomary is dead. In this case, after quorum the promotion seems pretty fast.

If you could also re-run the test with loglevel = debug, we'll also be able to see when each individual node marks the master as failed.

Comment From: Anuradha-26

Thanks @madolson, we fixed cluster-node-timeout to 6 seconds and this time election triggered after 6 secs