Describe the bug
Suppose we have a 2 shards cluster mode enabled Redis cluster, each shard just has one node - the primary. Let's label primary of shard 1 to be P1, and primary of shard 2 to be P2.
Suppose there is a one-way 100% packet loss from P1 to P2, essentially a asymmetric network partition. This could be caused by either an issue in the underlying network, or a bug in the Redis software, etc.
Now, from the perspective of P1, it has two TCP connections with P2 in steady-state - One outbound connection it established toward P2 here, one inbound connection it accepted from P2 here.
In face of the aforementioned asymmetric network partition:
1. The outbound connection is associated with a send buffer that can grow unbounded if data(e.g. PubSub messages) keeps being added but can't be sent. And because this memory consumption is not exempted from the maxmemory threshold, it will eventually result in total write outage for clients in the form of -OOM errors.
2. Both outbound and inbound connections won't be freed because Redis has an existing logic where it only frees the outbound connection if both inbound and outbound connections cannot receive any data - which isn't true in the case of asymmetric network partition, data can still be received from the inbound connection.
3. P1 won't mark P2 as fail? because has an existing logic where it only marks a peer fail? if both inbound and outbound connections cannot receive any data from that peer
This situation can only be possibly resolved if TCP Keepalive eventually kicks in. Redis' default TCP Keepalive timeout is set to 300 seconds/5 mins (recently change to 2*server.cluster_node_timeout, but only for inbound links), which means TCP Keepalive will kick in after 5 mins of radio silence on a TCP connection. And TCP Keepalive will probe 3 more times before it eventually closes the connection, which will collectively take another 5 mins. In total, a radio silent cluster bus connection won't be closed for at least 10 mins (could be longer if users sets TCP Keeplive timeout to be larger). Moreover, TCP Keepalive won't kick in at all if the packet loss isn't 100% or the network partition is formed at Redis software level(Imagine a code bug where P1 doesn't correctly register read/write handlers for its outbound cluster bus link toward P2 even though the underlying TCP connection is perfectly healthy).
We've seen cases where a single cluster link's send buffer can grow to 16GB within 10 mins, in face of a steady stream of PubSub messages of size 1KB. The memory growth is not linear because jemalloc does pre-allocation - as the size of the currently used memory gets larger, the size of each pre-allocation becomes larger as well. Below is a graph that shows the memory growth curve of a single cluster link's send buffer:
(X axis is the number of jemalloc allocations over 10 mins, Y axis is the total memory size of a cluster bus link send buffer in bytes)
Expected behavior
There are couple ways to mitigate the aforementioned issues:
1. As a short term thing, IMO we should detect liveness of inbound and outbound connections separately. If we haven't received anything from the outbound connection, we should free it, which will in turn release its send buffers. We shouldn't keep it around just because its inbound counterpart is alive.
2. In the longer term, I propose we introduce memory management mechanism for cluster bus link buffers, similar to how we manage memory usage of clients:
a. Introduce cluster-link-buffer-limit, similar to how client-output-buffer-limits work for clients. Disconnect cluster bus links if they overrun this limit.
b. Exclude memory usage of cluster links from maxmemory and manage it separately, similar to how we manage client output buffers.
c. Display memory consumption of cluster links in the INFO Memory section.
If we can reach agreement on these proposals, I can create separate PRs for #1 and #2. (#2 probably will be a bigger one and generates more discussions, so don't want to bundle them together.)
Additional information
Any additional information that is relevant to the problem.
Comment From: abhaagar
To further add to (1), there should be some kind of heartbeat exchange between every pair of nodes after every y secs. If a node does not receive heartbeats continuously for x secs/mins, the peer node should be marked as DOWN on which outbound connection should be closed.
Comment From: yossigo
@ny0312 Thank you for the detailed analysis. I tend to agree about the short term and long term goals here, but I'm wondering about the best way to achieve [1] - perhaps all we need is a more aggressive TCP keepalive? I agree that in theory doing explicit liveness check will cover more cases, but in practice I'm not sure there will be a real difference.
@madolson any other thoughts about this?
Comment From: ny0312
@yossigo Thank you for your reply. I fully agree with your assessment on the pros and cons of TCP keepalive:
- It doesn't cover connection "leaks" due to application(i.e. Redis) level bugs.
Imagine a code bug where P1 doesn't correctly register read/write handlers for its outbound cluster bus link toward P2 even though the underlying TCP connection is perfectly healthy - It doesn't directly close the connection. It generates an IO event and relies on application(i.e. Redis) to catch and properly handle that event to close the connection. It doesn't work if that second part of the deal doesn't happen due to a software bug.
For these reasons, I keep leaning toward an application level liveness check. Thoughts? Curious what others think too.
Comment From: ny0312
@kuriousengineer You are essentially proposing that Redis should mark a peer as fail? or fail if one of the inbound/outbound connections with that peer is unhealthy.
It's an interesting question and I'm curious about what the core members think.
Personally. I didn't want to propose that because I felt like it makes sense to not mark a peer fail?/fail as long as one of the connections is healthy and continuously have traffic, that proves the peer node itself is alive and responsive. The unhealthy connection is most likely due to a network issue or something, which is not what fail?/fail flags for.
Comment From: ny0312
Created PR to introduce memory management on cluster link buffers(item #2 in the original post): https://github.com/ny0312/redis/pull/1
Comment From: madolson
@ny0312 Can you re-submit the PR against the main redis repo instead of your fork?
Comment From: abhaagar
@ny0312 I am proposing to mark a peer as fail if heartbeats are missed for a period.