Redis [BUG] Sentinel failover does not happen when node went down

Describe the bug Sentinel failover does not happen when node went down. env detail: - ocp Kubernetes multi node environment.

To reproduce

1)deploy redis with sentinel in High Availiability mode (cluster size =3). 2) shutdown the master node. 3) when node went down failover should start but always getting "45:X 15 May 2023 11:41:47.224 # Failed to resolve hostname".

Expected behavior failover should start within configure time and slave nodes should be able to serve the request.

Additional information 1)when node went down failover is not happening and always getting resolve hostname error. 2)when the same node comes up other two instances are able to resolve hostname and failover happens.

Comment From: moticless

Hi @rimverma, I am sorry but a lot of details are missing. What are the configuration files? What is the scenario? Need recorded logs of the problem. Which version are you using? did you try to isolate the problematic version? Are you assisting any other repo to deploy k8s - if that so, maybe try get there help for a start. Thank you.

Comment From: rimverma

Hi moticless, Sorry for Delayed Response . Redis version which i am using is 6.2.12 but i tested the scenario with version 7.0.11 also but issue persist. scenario:- whenever master node went down , other slave node is unable to resolve hostname and they are unable to serve to services also failover is not starting when node went down. configuration file and log i am attaching for your reference . statefulset-ricplt-dbaas-cluster-0-server-1_sentinel_shutdownserver2.log sentinel.conf:- Redis [BUG] Sentinel failover does not happen when node went down redis.conf:-

Comment From: rimverma

Hi Team, Gentle Reminder. I have attached logs and config file, kindly check. Regards, Rimjhim

Comment From: moticless

Hi, It looks like basic networking issue. Please verify that the sentinel that fails to resolve hostname, can reach statefulset-ricplt-dbaas-cluster-0-server-2.service-ricplt-dbaas-tcp-cluster-0.ricinfra.svc.cluster.local and DNS resolve it to corresponding ip.

Comment From: rimverma

Hi, This pod (server 2) went down so when pod went down after some seconds it should start failover but in my case failover is not getting start it is always checking for hostname for pod(statefulset-ricplt-dbaas-cluster-0-server-2.service-ricplt-dbaas-tcp-cluster-0.ricinfra.svc.cluster.local) which is already down. Please check this .

Comment From: moticless

according to your logs, the "resolve hostname" error happens around 20sec after the sentinel starts, before any switch over. The master is dbaasmaster-cluster-0 statefulset-ricplt-dbaas-cluster-0-server-0.service-ricplt-dbaas-tcp-cluster-0.ricinfra.svc.cluster.local and is responsive at start. But from the start the replica statefulset-ricplt-dbaas-cluster-0-server-2.service-ricplt-dbaas-tcp-cluster-0.ricinfra.svc.cluster.local is not respnosive and you get "resolve hostname" error. Fix it before you carry on.

Comment From: rimverma

Hi, yes resolve hostname error is there for server2 but if we see in log that time services are fine when master(server0) pod went down we started getting error (below in the log )for server0 "45:X 03 May 2023 12:12:20.989 # Failed to resolve hostname 'statefulset-ricplt-dbaas-cluster-0-server-0.service-ricplt-dbaas-tcp-cluster-0.ricinfra.svc.cluster.local'" this error continues until unless we are not starting pod again , but normally it should start failover scenario after some sec when pod went down .

Comment From: rimverma

Hi moticless , You can reproduce this in our local environment also, below are steps to reproduce. env detail: - ocp Kubernetes multi node environment. To reproduce 1)deploy redis with sentinel in High Availiability mode (cluster size =3). 2) shutdown the cluster node where dbaas master is present.

Comment From: rimverma

Hi team , Can you please update on this . Regards, Rimjhim

Comment From: moticless

The failover scenario that you are describing is rather basic one and regularily tested. It is most probably due to integration with kubernetes which involve network configuration, pod presistency and it is behind the scope of this repo.

Please put an effort to further isolate the issue and if you verified that it is Redis issue alone, please open a new issue with the details irrelated to Kuberentes.

Comment From: zhaozhiguang

I have also encountered a similar problem. I am using the Docker Bridge network mode, and an error occurred when I stopped the master node of the Docker