To establish a connection, there needs to be at least 2 cycles of the event loop, the first to accept the connection and the second to read the first query from the client. If Redis is running very slowly, the time between these two interactions might exceed the clients timeout causing it to tear down the connection and try to connect again. This can cause a significant amount of churn and in some cases cause a connection storm which can dramatically impact the performance of the Redis process.

For TLS cases, this is even worse, since the TLS connection need multiple round trips to establish the connection (of which these round trips are CPU expensive because of the public/private crypto needed). We should prefer to establish some connections, and the let the remaining ones retry.

My suggestion is we "limit" the number of inflight connections at a given time to some value, lets say 1000 (since that is the current accept count), to bound the impact of this issue. This implicitly prioritizes some clients so that they are established more quickly and can start serving traffic. @yossigo would appreciate your thoughts on this.

Comment From: JimB123

I would go a step further and say that the "accept count" of 1000 is a wildly inappropriate value.

This is indeed a "prioritization" issue, not a limiting or throttling issue. When the event loop is behaving properly, it is spinning very fast. The event loop only slows down when CPU is at 100%. Once the CPU reaches 100%, some prioritization of work is needed. Specifically, handling existing connections is generally more important than accepting new connections.

By reducing the "accept count", this gives greater priority to existing connections, without starving new connections. During periods of "normal" traffic (when CPU is < 100%) the event loop will spin very fast, allowing lots of connections to be accepted. Once the CPU reaches 100% and the event loop begins to slow, the rate of new connection accepts should slow correspondingly, allowing priority to be given to existing connections.

By lowering the "accept count" all the way to 1, it achieves the goal of prioritizing existing connections (including in-flight connection establishment) above new connections without imposing artificial limits on the new connection rate or the number of in-flight connection establishments.

Comment From: madolson

I don't think the solution you outlined achieves much different and introduces calling epoll more aggressively in-between calls. The target I want to resolve is getting some clients to accepted state more quickly.

By lowering the "accept count" all the way to 1, it achieves the goal of prioritizing existing connections (including in-flight connection establishment) above new connections without imposing artificial limits on the new connection rate or the number of in-flight connection establishments.

You are just proposing a different artificial limit no?

Comment From: JimB123

You are just proposing a different artificial limit no?

No, this isn't a limit. Unencumbered, the event loop spins very fast - much higher than any incoming connection rates. Lowering the accept count is a prioritization mechanism, not a limiting mechanism.

Consider that if the CPU is < 100%, there is idle time in the event loop. During this idle time, the loop can spin fast enough to accept lots of connections. However, if the CPU is at 100% (no idle time) then it's important to provide a greater priority to existing work (preventing activites from timing out).

Contrast this with a LIMIT on in-flight connection establishments. If the client-side is slow in responding to TLS negotiation messages, the engine could be idle even with a large number of new connections in flight. Placing limits on in-flight connection establishments in this type of scenario would artifically slow connection establishment. Conversely, consider that given any fixed limit on in-flight connection establishment, there's no way to ensure that existing work will be prioritized sufficiently to complete without timing out.

As long as the cost of epoll is much smaller than the cost of connection establishment, the cost of the extra spins is unimportant. (Of course this should be measured/verified.)

Comment From: madolson

Ok, sounds like there is some testable work we can do to evaluate the two approaches. It sounds like both approaches has some assumptions we should investigate.

Comment From: madolson

Consider that if the CPU is < 100%, there is idle time in the event loop. During this idle time, the loop can spin fast enough to accept lots of connections. However, if the CPU is at 100% (no idle time) then it's important to provide a greater priority to existing work (preventing activites from timing out).

Thinking about CPU time is a poor model for Redis generally though. Redis workloads can be instantaneously spiky while normalized to relatively low CPU, but the time when Redis is fully utilized is the interesting bit. Your approach gives no ability for us to "burst" into any capacity, with the assumption that epoll is free. Perhaps the optimal solution is a mix of both, where if we aren't doing much we attempt to accept more connections, and when there is a lot of other work going on we try to accept less.

Comment From: yossigo

@madolson Is there some real world context you can share that lead you to come up with this idea? I also think of this as a way to increase fairness and avoid starvation, and not an artificial limit, although an extreme setting (e.g. max accept count = 1 and a small TCP backlog setting) will probably end up as a hard limit.

In the context of TLS, I think anything we can come up with will probably have limited value and what we really need is some form of I/O threads (done right).

Comment From: madolson

@yossigo Yeah, we've seen a lot of connection storm cases, where clients are establishing too many connections, they bog down redis, and none of them finish. @JimB123 being an expert here and can provide some more detailed examples. TLS is really the problematic cases, where it's possible to commit to a lot of future TLS work and cause it to take forever in Redis.

I/O threads, yeah that needs to happen.