Redis Sentinel State Incosistency - Nineya|java/go/python

Redis version: 4.0.1

We are running 1 master with 1 slave and 4 sentinels on Kubernetes.

Out problem is that sometimes one of the sentinels randomly starts thinking that Redis instance is down even though it is now - it is fully responsive and other sentinels are still seeing it and able to perform failover.

The output from sentinels who considers machine down contains same IP/PORT combination as output from sentinel which is up. A machine which appears down is connectable from the pod/node where the affected sentinel is. Starting new sentinel with the original configuration on the same pod/node works correctly.

Example output from affecting sentinel:

1)  1) "name"
    2) "10.4.20.88:16379"
    3) "ip"
    4) "10.4.20.88"
    5) "port"
    6) "16379"
    7) "runid"
    8) "8d284ec83d737a6455a3d2b4dd3ab22ab8c88462"
    9) "flags"
   10) "slave"
   11) "link-pending-commands"
   12) "192"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "69416299"
   19) "last-ping-reply"
   20) "69416299"
   21) "down-after-milliseconds"
   22) "3000"
   23) "info-refresh"
   24) "66714517"
   25) "role-reported"
   26) "slave"
   27) "role-reported-time"
   28) "70330640"
   29) "master-link-down-time"
   30) "1505763519000"
   31) "master-link-status"
   32) "err"
   33) "master-host"
   34) "10.4.19.63"
   35) "master-port"
   36) "16379"
   37) "slave-priority"
   38) "100"
   39) "slave-repl-offset"
   40) "1"

The output from the another sentinel for the same cluster:

1)  1) "name"
    2) "10.4.20.88:16379"
    3) "ip"
    4) "10.4.20.88"
    5) "port"
    6) "16379"
    7) "runid"
    8) "8d284ec83d737a6455a3d2b4dd3ab22ab8c88462"
    9) "flags"
   10) "slave"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "626"
   19) "last-ping-reply"
   20) "626"
   21) "down-after-milliseconds"
   22) "3000"
   23) "info-refresh"
   24) "1393"
   25) "role-reported"
   26) "slave"
   27) "role-reported-time"
   28) "70353077"
   29) "master-link-down-time"
   30) "0"
   31) "master-link-status"
   32) "ok"
   33) "master-host"
   34) "10.4.19.63"
   35) "master-port"
   36) "16379"
   37) "slave-priority"
   38) "100"
   39) "slave-repl-offset"
   40) "3544531620430"

As you can see both slave IP and master IP match on both sentinels.

Sentinel log does not contain any relevant information. The only solution I was able to find is restarting sentinel which is not really even a solution.

There is no pattern, this happens completely randomly and only one sentinel is affected at a time. Generally, this doesn't cause any issues but if we wait long enough 2-3 sentinels are affected which causes random failovers.

Any idea what could be the problem or any pointers on how to debug it?

My guess would be that issue is somehow related to containerized environment and sentinel is somehow caching connection in a way it thinks it's down even though it isn't and it is not retrying to reconnect.