Recently we had an occurrence where we were unable to connect to mutliple masters on our cluster when they became unresponsive.
Connections from our code base were timing out, as a connection could not be made. We were also unable to SSH into the box during this period, essentially locking us out.
This has happened on multiple occassions and each time the CPU was around 20% and memory usage was also around 20%. The number of connected clients varied during each event between 7k and 12k, well below what we would expect to be an alarming level. Commands per second were between 15k and 20k.
Connections that were already established continued to function normally. Among those existing connections were our metrics exporters, so they were able to still collect metrics on connections/cpu etc.
The network in/out would slowly decline as existing connections died off, however new ones could not connect at all, as if they were refused by the server.
Also, because connections still existed between the Masters and Replicas the cluster state showed as OK, as they could still communicate, while any new connections could not be made.
We ended up failing over the replicas and rebooted the masters making them replicas. This would solve the problem temporarily, until the next time the issue arose. Which didn't seem directly correalted to the number of connections or OPS, as it varied.
We are currently using Redis version 6.2.6.
The cluster nodes are running on AWS x2gd.medium instance types.
We also adjusted settings for SOMAXCONN, etc. and it sill occurs.
Does anyone have any thoughts/recommendations on possible causes?
Comment From: oranagra
Seems like a system issue, not a Redis issue (considering that you can't also establish new SSH connections, and the existing Redis connections keep working). Personally, i don't know what this could be, but you should probably direct such questions elsewhere, not sure where..
Comment From: markmcdowell
@Tbone542 we have what looks like the same/similar issue. Do you have persistence on? We don't seem to have it on our non-persisting clusters.
Comment From: karock
@markmcdowell We (I work with Tbone542) use persistent connections via phpredis driver/client. Years ago we didn't use persistent connections and had other problems and bad latency til we enabled it, persistent is the only way to go for us.
Comment From: karock
Some other info we noticed since posting this... Our API autoscaling group (EC2 instances on AWS) was using autospotting to convert on-demand boxes to spot, and it seems that some of the extra startups/shutdowns from that were likely part of the root cause for us. We had to disable that for other reasons over the holidays and the connection issues haven't happened again since. The slower instance scale-out/in due to the 24 hour load cycle isn't enough to cause connection issues apparently.