Redis One node is running into issue but redis cluster still report healthy

We hit one problem recently. In general, when one node is running into issue and fail to take the request, the redis cluster still report this node as healthy. Through the command "CLUSTER INFO" or "CLUSTER NODE", it shows the cluster are healthy and all nodes are healthy. That caused our service call failed when set or get the value from redis. The node failure might be expected but I will expect this node will be reported as unhealthy and the follower node can be promoted. Is there any known issue about it?

More detail: The redis cluster was deployed in kubernetes. We successfully mitigated this issue by restart this pod.

We have c# and golang client. For c# client, we used the SDK "StackExchange.Redis". The error we got is like "StackExchange.Redis.RedisServerException: MOVED 7197 10.0.209.32:6379 ..." during set or get.

For golang client, we used radix and see the error "cluster action redirected too many times" when get or set the value.

Comment From: madolson

Both of those clients indicate that the topology being used to route requests was inconsistent with the actual topology. Since it happened on multiple clients, it's unlikely to be a client issue, and was likely that the cluster wasn't in some correct state. Do you have any additional information about the cluster state at that time such as logs? Without those it will be hard to root cause the failure mode here.

Comment From: Just4Ease

Experienced same too both with an online hosted redis and a self-managed cluster, same issues. Screenshot 2022-07-17 at 12 51 56 PM

Comment From: PingXie

I wonder if this is related to incomplete slot migration? @huanwu, @Just4Ease, were there any cluster management operations performed prior to the failure/error? As @madolson mentioned, if you could share your redis.conf and the Redis logs from a few nodes, it would allow us to better understand/diagnose the issues.

PS, the cluster states reported by CLUSTER INFO and CLUSTER NODE are viewed from the reporting node's perspective. Given the eventual consistency nature of the Redis cluster protocol, you could get different answers from different nodes if the cluster state changes have not been propagated to the entire cluster.

Comment From: huanwu

After some investigation, it might be related to k8s deployment. We found the redis cluster service ip is the same as the one of the pod ip (lead node). When the request was sent to this pod, it was actually sent to the service, which might redirect to any of the pods. But that POD obviously doesn't contain the key/value, so it sent back a MOVED response with target POD ip. The client follow this MOVED and send the request to problematic POD again. This happen again and again in a circle loop, which finally reach the client side redirect limit. In radix library, it throws "cluster action redirected too many times".