Hi,
We are facing an issue related to one of our Redis nodes marking itself as failed specifically this message:"Cluster state changed: fail" and immediately after performing automatic failover.
Under which setup this happened: - Redis version: 6.0.5 - 3 Masters and 2 replicas.
Redis config: bind: 0.0.0.0
Environment: - Azure - Linux
We have the following Network interfaces: 1. Physical one. (Bounded transparently to 2) 2. Synthetic one. 3. Loopback interface.
My question: - Whenever we remove one of the network interfaces ( physical one ) due to maintenance, we observed the node marking itself as Failed state and failover happening.
Would it be possible that as we are binding to all interfaces ( 0.0.0.0 ), Does Redis internals keep an state and when we remove the NIC , Redis thinks we lost networking connectivity and mark itself as failed ?
I was looking earlier to understand under which circumstances the cluster will marked itself as failed specifically when the clusterUpdateState gets to this point: https://github.com/redis/redis/blob/380f6048e0bbc762f12fa50b57b73cf29049f967/src/cluster.c#L4001
Really appreciate any clarification :)
Cheers, Laura
Comment From: oranagra
@laurauzcategui you mean that you have two physical NICs bonded into one, right?
by Bounded transparently to 2 you didn't mean it is bonded together with the synthetic one?
To the best of my knowledge redis doesn't listen to any networking configuration changes, it just uses POSIX socket API, so i'm guessing somehow removing a NIC from the bonding broke the connection.
I suggest you look into the Linux configuration and it's logs to figure it out, doesn't seem to be a redis related issue. Cluster nodes are marked as failed when they really can't communicate.
Comment From: laurauzcategui
Hi @oranagra ,
Thanks for replying.
To answer to your questions: We are bounding one of the NICs to the other one. For example: eth1 to eth0. This bounding is transparent and applications should see it as one.
My suspect is toward what you just mentioned, our Redis connection is dropped and we might need to dive towards Linux config to solve this issue.
The thing is, even though the NIC gets removed for a period of 2 minutes approximately, we still have connectivity on the host and we were trying to figure if at Redis level this was visible.
I'll follow-up here once we get a solution in place or if I have any other questions.
Comment From: oranagra
so when the bonding is broken, but one NIC is still there, the interface visible to user space application is still up, and connectivity seems working?
maybe there's a short (fraction of a second) period that in which the interface is completely down, and maybe in this case redis receives some errno value which it doesn't expect and that causes a failure that's not normally visible in other cases.
i assume you didn't find anything interesting in the log file, maybe strace will help here?
Comment From: laurauzcategui
Re: so when the bonding is broken, but one NIC is still there, the interface visible to user space application is still up, and connectivity seems working? - Yes. connectivity is still up. I could see packets still flowing through using bmw-ng
Re: maybe in this case redis receives some errno value which it doesn't expect and that causes a failure that's not normally visible in other cases. - Got it. It would be interesting to see If I can capture somehow one of those errno which will help to validate the hypothesis on the connection being dropped.
Let me check also with Strace and see what it will come up.
Comment From: laurauzcategui
I can close this issue as I found out the reason it was disconnecting wasn't due to bind config, but the Hyper-v restarts for around 30 secs and in between cluster node timeout happens as it's set to 5 secs.
Thanks for the help