Describe the bug
If sentinels are started with different monitor masters in sentinel.conf, they will not detect/report the issue and will instead silently remain in split-brain - each will believe its own master is the true master. As a result, clients will 2/3 times get a wrong master address and try to write to a read-only replica
To reproduce
- Prepare a 3 node k8s cluster (could work with a single node as well)
- Install redis from bitnami chart:
helm install --set sentinel.enabled=true,replica.persistence.enabled=false,replica.podManagementPolicy=Parallel test-redis oci://registry-1.docker.io/bitnamicharts/redis
- Due to the Parallel podManagementPolicy, this will create a 3 node setup where each sentinel will be configured with different master in their sentinel.conf:
# kubectl exec -it test-redis-node-0 -c sentinel -- cat /opt/bitnami/redis-sentinel/etc/sentinel.conf
sentinel monitor mymaster test-redis-node-0.test-redis-headless.default.svc.cluster.local 6379 2
...
# kubectl exec -it test-redis-node-1 -c sentinel -- cat /opt/bitnami/redis-sentinel/etc/sentinel.conf
sentinel monitor mymaster test-redis-node-1.test-redis-headless.default.svc.cluster.local 6379 2
...
]# kubectl exec -it test-redis-node-2 -c sentinel -- cat /opt/bitnami/redis-sentinel/etc/sentinel.conf
sentinel monitor mymaster test-redis-node-2.test-redis-headless.default.svc.cluster.local 6379 2
...
Expected behavior
Sentinels should detect the issue and trigger a failover to agree on a new master or at least somehow flag that something is odd. Instead, this situation persists indefinitely.
Additional information
- While I was able to hit this by using the bitnami chart, I think it really boils down to the sentinel itself. I am not sure how to create a minimal setup without involving helm/k8s, however, I do believe the same issue will happen with plain redis/sentinel.
- Chart uses redis 7.0.11
- There appears to be nothing in the sentinel logs that would hint sentinels have detected a problem:
# kubectl logs test-redis-node-0 -c sentinel
14:20:00.73 INFO ==> about to run the command: REDISCLI_AUTH=$REDIS_PASSWORD timeout 220 redis-cli -h test-redis.default.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
Could not connect to Redis at test-redis.default.svc.cluster.local:26379: Temporary failure in name resolution
1:X 06 Jun 2023 14:20:30.941 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 06 Jun 2023 14:20:30.941 # Redis version=7.0.11, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 06 Jun 2023 14:20:30.941 # Configuration loaded
1:X 06 Jun 2023 14:20:30.941 * monotonic clock: POSIX clock_gettime
1:X 06 Jun 2023 14:20:30.945 * Running mode=sentinel, port=26379.
1:X 06 Jun 2023 14:20:30.945 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:X 06 Jun 2023 14:20:30.946 # Sentinel ID is b4b48aa14e5e759cb7492b30af4f5c2a992e6bf2
1:X 06 Jun 2023 14:20:30.946 # +monitor master mymaster test-redis-node-0.test-redis-headless.default.svc.cluster.local 6379 quorum 2
# kubectl logs test-redis-node-1 -c sentinel
14:19:50.00 INFO ==> about to run the command: REDISCLI_AUTH=$REDIS_PASSWORD timeout 220 redis-cli -h test-redis.default.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
Could not connect to Redis at test-redis.default.svc.cluster.local:26379: Temporary failure in name resolution
1:X 06 Jun 2023 14:20:20.277 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 06 Jun 2023 14:20:20.278 # Redis version=7.0.11, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 06 Jun 2023 14:20:20.278 # Configuration loaded
1:X 06 Jun 2023 14:20:20.278 * monotonic clock: POSIX clock_gettime
1:X 06 Jun 2023 14:20:20.282 * Running mode=sentinel, port=26379.
1:X 06 Jun 2023 14:20:20.282 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:X 06 Jun 2023 14:20:20.283 # Sentinel ID is 93d594182506a64e9c0fb3e893ec67dbd7d3255d
1:X 06 Jun 2023 14:20:20.283 # +monitor master mymaster test-redis-node-1.test-redis-headless.default.svc.cluster.local 6379 quorum 2
# kubectl logs test-redis-node-2 -c sentinel
14:19:39.13 INFO ==> about to run the command: REDISCLI_AUTH=$REDIS_PASSWORD timeout 220 redis-cli -h test-redis.default.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
Could not connect to Redis at test-redis.default.svc.cluster.local:26379: Temporary failure in name resolution
1:X 06 Jun 2023 14:20:09.427 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 06 Jun 2023 14:20:09.427 # Redis version=7.0.11, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 06 Jun 2023 14:20:09.427 # Configuration loaded
1:X 06 Jun 2023 14:20:09.428 * monotonic clock: POSIX clock_gettime
1:X 06 Jun 2023 14:20:09.432 * Running mode=sentinel, port=26379.
1:X 06 Jun 2023 14:20:09.432 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:X 06 Jun 2023 14:20:09.437 # Sentinel ID is 362c939b89efbabc09ba1d11a50146bccd5614d9
1:X 06 Jun 2023 14:20:09.437 # +monitor master mymaster test-redis-node-2.test-redis-headless.default.svc.cluster.local 6379 quorum 2
Comment From: shadjiiski
Related issue that I logged for the chart providing inconsistent configuration: https://github.com/bitnami/charts/issues/17047. However, I do believe there is room for improvement in sentinel itself
Comment From: palfrey
I've just hit this without k8s, plain server/sentinel setup.