Redis Redis server-side latency - Nineya|java/go/python

Context

Today Redis provides partial server-side latency information on command execution in the form of total usec for a given command and average usec_per_call which basically tracks the call() time for a given command.

These stats report the actual CPU time a command execution took, which is a portion of the total time it took for Redis to actually process the command, which includes, the time the command was waiting due to Redis multiplexing. For example, say two commands (e.g., GET and EVAL) are fully received in the socket buffer, then Redis executes aeProcessEvents. First Redis processes (reads the socket, parses then executes the command) the GET, which takes 5us then places the response in the client output buffer. Then Redis executes EVAL which takes 1000us then places the response in the client output buffer. In the beforeSleep Redis sends to the responses to the clients. At this point, both the GET and EVAL latency is approximately 5 + 1000us (assuming little overhead to other functionality in beforeSleep and instantaneous writeToClient). The reason is that before flushing the GET response from the client output buffer, Redis had to process the EVAL. Similarly, before processing the EVAL, Redis had to process GET.

EVAL was used to illustrate that a slowdown in one of the tasks during processSingleFileEvent may slowdown the rest of the commands that are queued up for processing during the current iteration of the event loop. That can be particularly interesting for commands, e.g., GET that have to wait for other commands that are performing fsync during aof. While it may be trivial to understand why a SET command is slow during aof, it does not make sense from the client side to perceive additional latency for a GET as a side effect of that.

Proposal

A server-side, per-command latency, that tries to approach as much as possible the time from receiving the last byte of a command till the first byte of its response is sent to the client. The per-command latency can be reported using histograms (e.g., LATENCY HISTOGRAM) to give insights about spikes as opposed to the current reporting of commandstats which displays an aggregate.

The per-command latency could incorporate various components, including the multiplexing overhead, processing, module blocking, aof, etc. We could consider blocking command latency, but that does not appear to be actual server-side induced latency. As a result, an enhanced version of slowlog could be implemented to incorporate all of these components.

While the best way to measure latency is to track the end-to-end latency from client side, this is very useful to triage issues and identify what component is contributing to the latency (i.e., is it Redis vs network link).

Comment From: oranagra

it would be nice if we could have that feature, but i currently don't see how to do that.

we could easily mark the time where we read the last argument of a command from the query buffer into the argv array, but that's anyway, nearly the same time where we start processing the command (after possibly being held back by other clients or even other commands of the same client), so what you want is know when we read these bytes into the query buffer (which is hard to map which command they belong to), or even better know when these bytes were ready to be read (sine even reading into the query buffer is delayed by processing other clients).

the other side of that issue is recognizing when the last byte of the reply was written to the socket, but same here, a single node in the output buffer list can represent more than one command, so we'll need to keep some mapping between offsets and commands.. too complicated IMHO.

Comment From: YacineTALEB

I agree with you that implementing a precise server-side latency would add lots of complexity. We would need to know when the command was actually received in the socket buffer, which may require tinkering in the kernel. That would add lots of complexity. However, we can implement a fairly representative metric without much complexity.

We could start a timer every time the event loop wakes up. At that time we are going to read the socket buffers. If a command's last byte is in the socket buffer, then we would have got an event notification from epoll and process it as part of the current iteration of aeProcessEvents. Then how do differentiate between partially received commands vs others? we can simply assign the current event loop timer in the processCommandAndResetClient, before calling processCommand.

When to close a timer? your remark is on spot, we would need to keep track of individual responses offsets if we want to be precise, but we don't have to. We can just declare a command has been fully processed once we try to flush the client output buffer. My argument is that we do not want to capture client-side side effects. For example, a client may be very slow at reading its socket, which eventually will impact our computation of server-side latency (you may argue server-side network issues as well can impact this). But we do not want that to be part of server-side latency. We fully processed a command and put it into the client output buffer, and the boundary should be as soon as we reach the writeToClient then we end the timer.

To summarize: - Start timer: when the event loop wakes, for fully received and ready-to-be-processed commands - End timer: when trying to flush the client output buffer

There are edge-cases when it comes to calling the event loop recursively, blocking commands, and pipelined commands, but those can be handled pretty easily.

Comment From: oranagra

so you're actually saying that we should measure the entire event loop cycle time, and attribute that time to all the commands that were processed in that cycle. i.e. when the cycle started, the last byte of each of the commands was already in the server socket, and when the cycle ended we at least attempted to write the reply (we don't attempt any writes before completing all the commands of that cycle). so the coding challenge here is just to keep track of all the commands that were executed in each cycle.

i'm not certain how useful this will be for users, arguably, adding one global metric representing the time (or some histogram) of the event loop cycle duration, can serve some of these purposes too.

keep in mind that tracking the commands and updating their stats is gonna slow down redis, so i'm not sure the overhead is worth the value we get from this feature.

@filipecosta90 FYI maybe you have something to contribute here.

Comment From: YacineTALEB

so you're actually saying that we should measure the entire event loop cycle time, and attribute that time to all the commands that were processed in that cycle.

Correct, that is the actual server-side latency, one command's execution time could impact all of the remaining pipeline.

so the coding challenge here is just to keep track of all the commands that were executed in each cycle.

It can be done by keeping a timer for each client. That timer is 0 if the client has no command being processed. When Redis starts processing a command from that client, it stamps the timer. Then the timer is captured and reset to 0 when the client output buffer is flushed. It works for most use-cases. Except pipelining that may require some additional information as a client may start getting processed in a certain event loop, gets blocked, then gets reprocessed later in an subsequent event loop iteration.

i'm not certain how useful this will be for users, arguably, adding one global metric representing the time (or some histogram) of the event loop cycle duration, can serve some of these purposes too.

This can become useful to understand why certain commands, e.g., GET gets higher latencies. E.g., you can correlate during a certain duration that certain command categories latencies jumped, then impacted others. For instance, if a client is using aof and then their reads starts getting higher latencies, they may not even notice. Having a metric that showcases the added latency in the write + reads during that time frame can be super useful and is currently missing from Redis.

keep in mind that tracking the commands and updating their stats is gonna slow down redis, so i'm not sure the overhead is worth the value we get from this feature.

The header histogram library used by Redis provides O(1) access and I believe it was benchmarked by @filipecosta90 already?

My only concern would be keeping histograms per individual command which may incur additional memory usage and is not be super useful IMHO. Thoughts?

Comment From: filipecosta90

The header histogram library used by Redis provides O(1) access and I believe it was benchmarked by @filipecosta90 already?

indeed it's O(1) but the overhead is around 0.8-1.2% depending on the command logic being heavier. Thus another 1% extra CPU cycles need to be discussed. If we add that to time tracking overhead I believe we might be talking about extra 2-3% impact on CPU cycles. Agree @oranagra ?

As mentioned also, the memory overhead will depend upon how granular does the info need to be.

WRT to the info this type of data will provide I guess that we will be able to deduce saturation metrics out of it. But won't simply the CPU usage (which is already provided ) + the internal latency be enough -- assuming we're not willing to pay the price of extra overhead?

Comment From: YacineTALEB

indeed it's O(1) but the overhead is around 0.8-1.2% depending on the command logic being heavier. Thus another 1% extra CPU cycles need to be discussed. If we add that to time tracking overhead I believe we might be talking about extra 2-3% impact on CPU cycles. Agree @oranagra ?

Is your assumption based on the fact that we log the latency after every command? In the approach I propose we only log the latency in the beforeSleep when flushing the client output buffer, so I expect the overhead being lower.

WRT to the info this type of data will provide I guess that we will be able to deduce saturation metrics out of it. But won't simply the CPU usage (which is already provided ) + the internal latency be enough -- assuming we're not willing to pay the price of extra overhead?

Which internal latency are you referring to?

Comment From: oranagra

Is your assumption based on the fact that we log the latency after every command? In the approach I propose we only log the latency in the beforeSleep when flushing the client output buffer, so I expect the overhead being lower.

but the histogram is collected per command. i.e. you track the entire event cycle time, and then add that to all the commands executed in this cycle. right?

Comment From: YacineTALEB

but the histogram is collected per command. i.e. you track the entire event cycle time, and then add that to all the commands executed in this cycle.

That is correct. We'd only save some cycles for pipelined commands that would have the same latency value (same event loop execution).

I suppose this feature could be hidden behind a feature flag as well?

Comment From: madolson

Sorry for chiming in so late. This has been a common ask in AWS. The current "latency" metric isn't really latency at all, but its CPU resource consumption. Knowing the CPU consumption is good for understanding which commands are causing Redis to behave slowly, but not necessarily to understand the overall performance of the system. In reality, the total latency is comprised of networking, CPU time, IO wait time (AOF), and congestion (other commands executing before ours is handled).

I think the idea here is useful, in that it's trying to capture some of the other pieces, namely congestion and IO wait time. I think it's useful to collect these pieces of information.

On suggestion that I think is particularly good is if we should just be collecting this information on read and write commands (Maybe other commands as well?). This removes a lot of the extra memory required for the histograms. Maybe this could be enabled by default, and the full per-command overhead could be enabled by opting in.

The design outlined seems pretty simple to me. We just need to record the time at the start of the event loop, attach that to all of the clients, and record that information when we write the command results out to clients. It is just some extra book keeping during command execution.

Comment From: filipecosta90

On suggestion that I think is particularly good is if we should just be collecting this information on read and write commands (Maybe other commands as well?). This removes a lot of the extra memory required for the histograms. Maybe this could be enabled by default, and the full per-command overhead could be enabled by opting in.

@madolson what about using the command group instead of just write/read default aggregation? In this manner the overhead is kept minimal and the breakdown will help understanding further which group of commands are causing congestion on the system.

Comment From: madolson

@filipecosta90 I think read/write/other and and fast/slow are probably the two important groups. I don't see any real issue with adding the remaining categories though.

Comment From: madolson

@yossigo @soloestoy @itamarhaber Do you guys have any input on whether or not this is a good feature?

Comment From: soloestoy

it would be nice if we could have that feature, but i currently don't see how to do that.

I don't know either....

This has been a common ask in AWS. The current "latency" metric isn't really latency at all, but its CPU resource consumption. Knowing the CPU consumption is good for understanding which commands are causing Redis to behave slowly, but not necessarily to understand the overall performance of the system. In reality, the total latency is comprised of networking, CPU time, IO wait time (AOF), and congestion (other commands executing before ours is handled).

we receive lots of complains about the "latency" too...

what we do is that we added a new latency "eventloop" to record the whole eventloop cycle, including io/process/aof/module/replication/eviction/expire...

diff --git a/src/server.c b/src/server.c
--- a/src/server.c
+++ b/src/server.c
@@ -2155,6 +2155,8 @@ void beforeSleep(struct aeEventLoop *eventLoop) {
      * releasing the GIL. Redis main thread will not touch anything at this
      * time. */
     if (moduleCount()) moduleReleaseGIL();
+
+    long long duration = (server.el_start == 0) ? 0 : (ustime() - server.el_start); 
+    latencyAddSampleIfNeeded("eventloop", duration/1000);
}

 /* This function is called immadiately after the event loop multiplexing
@@ -2162,6 +2164,7 @@ void beforeSleep(struct aeEventLoop *eventLoop) {
  * the different events callbacks. */
 void afterSleep(struct aeEventLoop *eventLoop) {
     UNUSED(eventLoop);
+    server.el_start = ustime();
     if (moduleCount()) moduleAcquireGIL();
 }

**Comment From: uvletter**

So do we... We have the similar solution with @soloestoy , with the difference we add into `slowlog` rather than `latency`. Besides as there're plenty of work to do in `beforeSleep`, so we count the time `beforeSleep` consumes, and add many optional tracepoints to `beforeSleep` to find out which function slow down the reaction.

SLOWLOG_EVENT(); if (server.aof_state == AOF_ON || server.aof_state == AOF_WAIT_REWRITE) flushAppendOnlyFile(0);

SLOWLOG_EVENT(); / Handle writes with pending output buffers. / handleClientsWithPendingWritesUsingThreads();

SLOWLOG_EVENT();
/* Close clients that need to be closed asynchronous */
freeClientsInAsyncFreeQueue();

`` UsuallyhandleClientsWithPendingWritesUsingThreads` would cause a latency spike under a heavy load.

Comment From: YacineTALEB

The approach used by @uvletter and @soloestoy makes a lot of sense. It is simple, low overhead, yet provides good insights.

However, as pointed out by @uvletter some components can take time, like flushing output buffers. Example: If a single event loop takes 10ms to execute 10 commands, and each command latency looks like: cmd1 - 1ms, cmd2 - 2ms, ..., cmd10 - 10ms, then we'd be losing useful information by just reporting 10ms as the total latency time.

The second problem I see is that blocking (i.e., modules) could span multiple event loops. Today we have the option of reporting blocking time in slowlog, however it still doesn't account for multiplexing/aof/etc. In order to get truly the total server-side latency for those commands, we would need somehow to track past event loops execution time, then reconstruct the total latency for those commands.

Finally, I think it's important (after solving both problems) to segregate between command categories. When reported under histograms, I expect these metrics would explain most of the issues a user would see and allow them to confirm or not if Redis is actually responsible for any added latency they see from the client layer.

Ideally, these server-side metrics should be reported in a slowlog with the main latency components breakdown to be most useful. But I believe this is the first step towards that goal.

Comment From: oranagra

i see value in some "event loop" latency metric, i'd rather it be an INFO field that can be plotted in a monitoring tool (e.g. Grafana), maybe both an "instantaneous" metric (easy to consume from redis-cli), and a raw cumulative latency (better for monitoring software). possibly some metric for max recent latency in some way to be able to detect spikes this would help monitor and realize what redis is doing at any given time.

maybe it'll be useful to also have something similar per client (in client list), i.e. the cumulative and recent / max latency of all event loops that this client was involved in.

however, i'm not certain about trying to attribute that to commands or command categories. i guess we don't need that in order to find the offensive command (e.g. big script delaying other commands of the same event loop), for that we already have the slowlog, and command stats. so i guess you want that in order to find the affected commands, but i'm not sure what that's useful for. arguably, if you see the global even loop latency metric rises or spikes, you know you have a problem, so do you really need to know which commands were (or are) affected, only find the offensive one.

i.e. there could be two interesting scenarios, one is a constant long event loop cycle (sever under constant load), and the other is a single latency spike due to some long command or long operation (e.g. fsync, expire cycle). in the first case (constant load), it doesn't really matter which commands or clients are affected, they all are. in the second case, i'd argue that maybe we wanna add a list of commands or clients to each slowlog entry (listing all the commands that got processed in the same event loop cycle with the slow command, since they're all affected due to the fact we flush output buffers only at the end of the event loop). but i don't see how it'll help to keep a per command or category stat for that.

btw, the total event loop cycle duration isn't always reflecting the latency the command experienced. other than what was already said about not including any network or OS buffers latency (being an under estimation), it can also be an over estimation in some cases. Imagine a case where redis wakes up with two sockets readable. it starts processing the commands from the first one, and let's say it contains a script that takes 100ms to finish. then it proceeds to read from the other socket, some bytes there are already waiting for 100ms, but it could be that some bytes only arrived there right before the socket read, so attributing 100ms latency to all the commands we process in that event loop, may not be always right!

Comment From: yossigo

I agree that SLOWLOG is often not enough, and Redis needs to provide more insight into latency, but it needs to give accurately measured metrics. Using heuristics and extrapolations to expose data that users perceive as accurate can lead to confusion when troubleshooting performance issues.

The problem we're trying to solve here is to help users answer two main questions. 1. What's the command latency my application perceives? I don't think there's an alternative to tracking this on the client. 2. What's contributing to latency on the server side? We need better observability to answer this, but I don't see why it needs to be done synthetically on a per-command basis.

As @soloestoy and @uvletter pointed out, there's much more to gain by tracking the event loop globally. Beyond fine-grained sampling of different loop stages, we can also count the number of commands per loop, clients per loop, etc.

Comment From: oranagra

so what's the next step here? anyone wanna propose a design and a PR for some info fields to track the event loop? how about: * eventloop_cycles - ever increasing counter * eventloop_duration_sum - cumulative duration so an external tool can compute deltas and average (using the cycles metric) * eventloop_duration_max - should this be the peak since CONFIG RESETSTAT, or have some kind of decay? * instantaneous_eventloop_cycles - cycles per second in the last 1.6 seconds * instantaneous_eventloop_duration_avg - average duration in the last 1.6 seconds * instantaneous_eventloop_duration_max - max duration in the last 1.6 seconds

regarding eventloop_duration_max like the other two metrics above it, i'd like it to enable an external monitoring tool that samples INFO at some interval (e.g. once a minute), can still detect peaks (unlike the "instantaneous" metrics which fade out quickly and would hide a spike).

Comment From: madolson

Personally I would really prefer we give some type of last byte in to last byte out metric, since that is the real measure of latency. Eventloop duration can both over estimate and under estimate the actual time it took to process a command. We've talked about refactoring how IO works, multithreading or some such, and in that world eventloop_cycles may no longer be a relevant metric. I feel like we're pushing the eventloop related metrics because we don't actually want to commit to a more accurate metric while at the same time being afraid to advertise a metric as more accurate. I don't feel all that strongly though, and would rather us publish something rather than nothing.

regarding eventloop_duration_max like the other two metrics above it, i'd like it to enable an external monitoring tool that samples INFO at some interval (e.g. once a minute), can still detect peaks (unlike the "instantaneous" metrics which fade out quickly and would hide a spike).

I think we should at least consider a histogram of eventloop_duration time. One common issues we've heard at AWS is the lack of introspective metrics into the distribution of what is going on in Redis. (I suppose that was the motivation of opening this issue). Just exposing that the event loop "has recently been congested" is only somewhat helpful, since it doesn't really explain if there was a P90, P99, or P100 issue. Generally Redis is bad at amortizing instantaneous spikes in utilization, e.g. suddenly sending 10k commands to Redis, since it needs to sequentially process all of those commands which causes latency spikes from the client perspective. I think we need to give clear indication to identify these types of issues, if it was very localized or it's a frequent event.

should this be the peak since CONFIG RESETSTAT, or have some kind of decay?

I think reset stats is sufficient if we implement a histogram with scalar values. Most collectors of data I have seen compute the deltas within their application and generate metrics from it.

Comment From: uvletter

Personally I would really prefer we give some type of last byte in to last byte out metric, since that is the real measure of latency.

I guess last byte in to last byte out may still not a real measure of latency, since beforeSleep, afterSleep or cron task may take some time, but redis runs serially so the request arriving at that time have to wait in the os buffer until redis is ready to read it.

I also like the idea of histogram, it exposes more detailed information to observe the performance

Comment From: oranagra

anyone wanna try to make a PR? at least for the simple things mentioned here

Comment From: soloestoy

We'd love to, @CharlesChen888 let's finish it.

Comment From: oranagra

btw, maybe in addition to that list, we can also try to break down the event loop composition by having something like: * eventloop_duration_cmd_sum - total execution time of all commands (measured in call) * eventloop_duration_io_write_sum - total time doing writes to sockets * eventloop_duration_io_read_sum - total time doing reading to sockets * eventloop_duration_aof_sum - total time writing of fsyncing to aof * eventloop_duration_cron_sum - total time spent in serverCron, and beforeSleep (excluding writes and aof)