Describe the bug

We experienced a strange behavior. Where out of 3 Redis nodes (deployed through helm), sentinel of POD 1 (redis-node-0) identifies one master (10.244.2.77). In just few seconds, in initial deployment itself, sentinel of other POD (redis-node-1) does not find the details of the master which was identified by node-0 sentinel. And so, it elects other instance as master (10.244.1.125). And in just few seconds marks the master of node-0 (x.x.x.77) to be converted to slave. The third POD (redis-node-3) finds the redis-node-2 elected master and deploys accordingly. After few seconds node-0 does not find its master and marks master down.

In this almost 1 minute of startup time, the situation created in a way that -> node-0 sentinel still has x.x.x.77 marked as master having a down state. -> node-1 sentinel has x.x.x.125 marked as master. -> node-2 sentinel has x.x.x.125 marked as master.

And so, the write requests which are handled by node-0 sentinel ends up failing with a message like "Cannot write against read only replica". The write requests handled by node-1 and node-2 works well.

Below are the sentinel container logs of these 3 Redis pods.

[app@app-node1 ]$ kubectl -n app -c sentinel logs redis-node-0 14:37:55.40 INFO ==> about to run the command: REDISCLI_AUTH=$PASS timeout 40 redis-cli -h redis.app.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster Could not connect to Redis at redis.app.svc.cluster.local:26379: Name or service not known Could not connect to Redis at redis.app.svc.cluster.local:26379: Name or service not known 1:X 30 May 2024 14:38:05.972 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:X 30 May 2024 14:38:05.972 * Redis version=7.2.4, bits=64, commit=00000000, modified=0, pid=1, just started 1:X 30 May 2024 14:38:05.972 * Configuration loaded 1:X 30 May 2024 14:38:05.973 * monotonic clock: POSIX clock_gettime 1:X 30 May 2024 14:38:05.974 * Running mode=sentinel, port=26379. 1:X 30 May 2024 14:38:05.974 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:X 30 May 2024 14:38:05.975 * Sentinel ID is 2a09ba7abbb41ee71e79087310d75f9809c3c815 1:X 30 May 2024 14:38:05.975 # +monitor master mymaster 10.244.2.77 6379 quorum 2 1:X 30 May 2024 14:38:26.407 * +sentinel sentinel 33535e4e17bf8f9f9ff9ce8f9ddf609e558ff4f2 10.244.1.125 26379 @ mymaster 10.244.2.77 6379 1:X 30 May 2024 14:38:26.409 * Sentinel new configuration saved on disk 1:X 30 May 2024 14:38:42.070 * +sentinel sentinel 9fe32540b27937ed9f341b0f610a0d8df405bb63 10.244.0.61 26379 @ mymaster 10.244.2.77 6379 1:X 30 May 2024 14:38:42.074 * Sentinel new configuration saved on disk 1:X 30 May 2024 14:39:16.082 # +sdown master mymaster 10.244.2.77 6379

[app@app-node1 ]$ kubectl -n app -c sentinel logs redis-node-1 14:38:13.64 INFO ==> about to run the command: REDISCLI_AUTH=$PASS timeout 40 redis-cli -h redis.app.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster Could not connect to Redis at redis.app.svc.cluster.local:26379: Name or service not known Could not connect to Redis at redis.app.svc.cluster.local:26379: Name or service not known 1:X 30 May 2024 14:38:24.352 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:X 30 May 2024 14:38:24.352 * Redis version=7.2.4, bits=64, commit=00000000, modified=0, pid=1, just started 1:X 30 May 2024 14:38:24.352 * Configuration loaded 1:X 30 May 2024 14:38:24.352 * monotonic clock: POSIX clock_gettime 1:X 30 May 2024 14:38:24.353 * Running mode=sentinel, port=26379. 1:X 30 May 2024 14:38:24.426 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:X 30 May 2024 14:38:24.427 * Sentinel ID is 33535e4e17bf8f9f9ff9ce8f9ddf609e558ff4f2 1:X 30 May 2024 14:38:24.427 # +monitor master mymaster 10.244.1.125 6379 quorum 2 1:X 30 May 2024 14:38:34.482 * +convert-to-slave slave 10.244.2.77:6379 10.244.2.77 6379 @ mymaster 10.244.1.125 6379 1:X 30 May 2024 14:38:42.035 * +sentinel sentinel 9fe32540b27937ed9f341b0f610a0d8df405bb63 10.244.0.61 26379 @ mymaster 10.244.1.125 6379 1:X 30 May 2024 14:38:42.037 * Sentinel new configuration saved on disk 1:X 30 May 2024 14:38:54.561 * +slave slave 10.244.0.61:6379 10.244.0.61 6379 @ mymaster 10.244.1.125 6379 1:X 30 May 2024 14:38:54.565 * Sentinel new configuration saved on disk

[app@app-node1 ]$ kubectl -n app -c sentinel logs redis-node-2 14:38:34.19 INFO ==> about to run the command: REDISCLI_AUTH=$PASS timeout 40 redis-cli -h redis.app.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster 14:38:34.29 INFO ==> printing REDIS_SENTINEL_INFO=(10.244.1.125,6379) 1:X 30 May 2024 14:38:40.001 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:X 30 May 2024 14:38:40.001 * Redis version=7.2.4, bits=64, commit=00000000, modified=0, pid=1, just started 1:X 30 May 2024 14:38:40.001 * Configuration loaded 1:X 30 May 2024 14:38:40.002 * monotonic clock: POSIX clock_gettime 1:X 30 May 2024 14:38:40.002 * Running mode=sentinel, port=26379. 1:X 30 May 2024 14:38:40.002 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:X 30 May 2024 14:38:40.003 * Sentinel ID is 9fe32540b27937ed9f341b0f610a0d8df405bb63 1:X 30 May 2024 14:38:40.003 # +monitor master mymaster 10.244.1.125 6379 quorum 2 1:X 30 May 2024 14:38:50.042 * +convert-to-slave slave 10.244.0.61:6379 10.244.0.61 6379 @ mymaster 10.244.1.125 6379

To reproduce

We don't have steps to reproduce this issue. It happens randomly.

Expected behavior

We understand that having quorum being set as 2, the master switch will take place when at least two sentinels mark the master as down. But in this case, where two sentinels refer common master and one has marked master down, shouldn't this sentinel be notified in some way that it can also consider Redis of x.x.x.125 as a master ?

Additional information

This is a critical issue as it requires Redis services redeploy in order to move forward. Does anyone know if any Redis/Sentinel configurations can resolve this until it gets fixed ? Also, would like to understand what could cause the Redis master to go down while initial deployment itself ?