Redis [NEW] Introduce a Server-Centered E2E Latency Measurement and Reporting Framework

One challenge faced by Redis users is measuring the end-to-end (e2e) latency for Redis commands. It's crucial for identifying performance issues, optimizing Redis performance, and improving the user experience. However, a pure client-only solution is far from ideal due to a couple of reasons:

Achieving consistency among all Redis clients out there is impossible. Client A may report latency in a completely different format and use a different pipeline than client B.
There's no feedback loop for Redis server administrators, leaving them unaware of how their service is perceived by clients.

To address this challenge, we propose a new set of Redis commands that allow clients to record and view latency and failure data for Redis commands. Here's a quick overview of the proposed solution:

Clients keep track of latency histograms for commands.
Periodically, clients send the latency histograms back to the server.
The server maintains latency histograms for each client and aggregates them at the server level.
The server provides commands for both clients and admins to query the latency data.

Now, let's dive into the details of the design:

STAT RECORD LATENCY

The STAT RECORD LATENCY command records the latency data for Redis commands with custom latency buckets. The first command reported determines the histogram schema, and subsequent commands must use the same histogram schema; otherwise, the command will fail. The command format is as follows:

STAT RECORD LATENCY <latencies> [CLIENT_ID <client_id>] [CMD <cmd_1> <counts_1>] [CMD <cmd_2> <counts_2>] ... [CMD <cmd_n> <counts_n>]

latencies: A list of latency buckets in milliseconds. The list must have n elements, where n+1 is the number of latency buckets (i.e., the number of elements in the counts list). The first element represents the lower bound of the first latency bucket, and the last element represents the upper bound of the second-to-last latency bucket.
CLIENT_ID: The ID of the client for which the latency data is recorded. If this token is not provided, then the server uses the client id associated with the connection over which this command is sent.
CMD: The name of the Redis command that was executed (e.g., SET, GET, INCR, etc.).
counts: A list of counts of Redis commands for each corresponding latency bucket. The list must have the same number of elements as the number of latency buckets plus one.

For example, the following STAT RECORD LATENCY command records the latency data for two Redis commands with custom latency buckets for client 1234:

STAT RECORD LATENCY 1 2 5 10 CLIENT_ID 1234 CMD SET 10 20 30 40 50 CMD GET 5 10 15 20 30

This command records the latency data for a SET command and a GET command with custom latency buckets. The latency buckets for the SET command are as follows:

1: [0 ms, 1 ms)
2: [1 ms, 2 ms)
5: [2 ms, 5 ms)
10: [5 ms, 10 ms]
Implicit: 10 milliseconds and above

The corresponding counts for each latency bucket for SET are 10, 20, 30, 40, and 50, respectively.

STAT VIEW LATENCY

The STAT VIEW LATENCY command allows the user to view the recorded latency data for Redis commands. The command format is as follows:

STAT VIEW LATNECY [CMD <cmd>] [CLIENT_ID <client_id>]

-CMD: The name of the Redis command for which to view the recorded latency data. If this token is not provided, then the command will return data for all Redis commands. -CLIENT_ID: The ID of the client for which to view the recorded latency data. If this token is not provided, then the command will return data aggregated for all clients.

For example, the following STAT VIEW LATENCY command returns the latency record aggregated for all clients:

STAT VIEW LATENCY

Output:
CLIENT_ID: -1, TIMESTAMP: 2022-05-15 18:32:12, CMD: GET
HISTOGRAM:
0-1ms: 1512444
1-2ms: 712423
2-5ms: 33454
5-10ms: 1123
>10ms: 12
CLIENT_ID: -1, TIMESTAMP: 2022-05-15 18:32:12, CMD: SET
HISTOGRAM:
0-1ms: 15124
1-2ms: 12423
2-5ms: 3454
5-10ms: 123
>10ms: 19
...

The following STAT VIEW LATENCY command returns the latency record for all Redis commands for a specific client with client ID 1234:

STAT VIEW LATENCY CLIENT_ID 1234

Output:
CLIENT_ID: 1234, TIMESTAMP: 2022-05-15 18:32:12, CMD: GET
HISTOGRAM:
0-1ms: 12444
1-2ms: 2423
2-5ms: 454
5-10ms: 23
>10ms: 2
CLIENT_ID: 1234, TIMESTAMP: 2022-05-15 18:32:12, CMD: SET
HISTOGRAM:
0-1ms: 1124
1-2ms: 1423
2-5ms: 354
5-10ms: 13
>10ms: 9
...

Server Data Structures

To store the latency reports, we can use a combination of Redis hash maps and lists. There are two global hashmaps:

The first hashmap tracks per-client latency stats, mapping unique client ID to a secondary hash map. The secondary hash map will have the command name as the key, and a list of integers representing the histogram bucket counts for that command as the value. The number of entries in the list will be the same as the number of latency buckets.

When a client sends a STAT REPORT LATENCY command, the server will receive a list of latency histograms, each histogram representing the latency buckets for a specific command. The server will then loop over each histogram and update the corresponding hash map in the client's hash map. If the hash map for the client or the command does not exist, the server will create it and insert it into the global hashmaps

In addition to the client-specific hash maps, we will also maintain a global hashmap containing the sum of all clients' latency reports. This hash map will have the command name as the key and a list of integers representing the histogram bucket counts as the value. The server will add the counts from each client's hash map to the corresponding key in the global hash map.

Here is an illustration of the hashmaps:

[Hash Map]
client_latency_stats: { client_id_1 -> { command_1 -> [10, 20, 30, 40, 50],
                                         command_2 -> [5, 10, 15, 20, 25],
                                         command_3 -> [0, 5, 10, 15, 20],
                                         ... },
                        client_id_2 -> { command_1 -> [15, 25, 35, 45, 55],
                                         command_2 -> [7, 12, 17, 22, 27],
                                         command_3 -> [2, 7, 12, 17, 22],
                                         ... },
                        ... }

[Hash Map]
global_latency_stats: { command_1 -> [25, 45, 65, 85, 105],
                        command_2 -> [12, 22, 32, 42, 52],
                        command_3 -> [2, 12, 22, 32, 42],
                        ... }

Client-Side Implementation

To incorporate the STAT RECORD LATENCY command into the open-source Redis client library, we can add a new method or function that allows clients to record latency data for Redis commands. This method or function would generate a latency histogram for the specified Redis command and key, and then use the STAT RECORD LATENCY command to store the histogram data in Redis.

In addition to this client-side implementation, we can also develop a proof-of-concept prototype in Python that records command latency and periodically sends the cached latency statistics to the server. This prototype would use the STAT REPORT LATENCY command to report latency data for Redis commands, and would provide a way to test the performance and functionality of the proposed Redis commands in a real-world setting.

[Sample Python client code]

import random
import time
import redis

# Create a Redis client
client = redis.Redis(host="localhost", port=6379, password="", db=0)

# Define the latency buckets
latency_buckets = [0.001, 0.002, 0.005, 0.01]

# Initialize the global array to track the number of hits in each latency bucket
bucket_counts = [0] * (len(latency_buckets) + 1)

# Loop through a few Redis commands
for _ in range(10):
    # Choose a Redis command at random
    cmd = random.choice(["SET", "GET", "INCR"])

    # Choose a random key
    key = str(random.randint(0, 99))

    # Execute the Redis command and record its latency
    start = time.time()
    if cmd == "SET":
        client.set(key, random.randint(0, 9999))
    elif cmd == "GET":
        client.get(key)
    elif cmd == "INCR":
        client.incr(key)
    elapsed = time.time() - start

    # Find the appropriate latency bucket index
    bucket_index = len(latency_buckets)
    for i, bucket in enumerate(latency_buckets):
        if elapsed < bucket:
            bucket_index = i
            break

    # Increment the hit count for the corresponding bucket
    bucket_counts[bucket_index] += 1

# Record the latency for the command
latency_arguments = []
for latency in latency_buckets:
    latency_arguments.extend([str(int(latency * 1000))])
client.execute_command("STAT REPORT LATENCY", *latency_arguments, "CMD", cmd, *bucket_counts)

Comment From: madolson

I like the idea of us trying to better instrument the latency around Redis, but curious why we think the right approach is to put the metrics into Redis as opposed to a 3P telemetry tool like open telemetry? I am aware it doesn't directly answer the "administrator" problem, but you could work with users to publish to a single repository.

Comment From: QuChen88

I agree with you that keeping track of the latency profile from client POV is useful for redis end users. However, I don't think it makes sense to store the client perceived latency profile on the redis server side.

Just to share some of my recent experiences - one of our users implemented their caching application in NodeJS. The way that latency tracking gets instrumented on the client application is something as follows:

start_time = mstime();
client.set('key', 'value');
duration = mstime() - start_time;

However, after running the application for a few days while collecting the latency profile, we noticed that the P50 latency on client side is abnormally high compared to server side command processing latency. Further analysis showed that most of the latency is coming from the client side. Namely, the TCP connection socket didn't have TCP_NODELAY enabled which added significant additional delay to each command, and NodeJS has a single threaded event loop which added delay to sending commands to the server depending on what other processing the client application was doing. The P50 latency was reduced by ~10x after fixing those issues.

Essentially, the latency that is perceived by the client application can be influenced by many other factors that are outside of the control of the redis server. Redis doesn't track which type of client and what programming language was used by the client, so it doesn't seem storing this information on the server side is a good idea. i.e. if the redis server is being used by multiple different clients (let's say NodeJS client and Java client), and Java doesn't have the same performance issue as NodeJS, then the latency tracked by one can pollute the latency tracked by another client and the administrator might not get an accurate picture as to where the performance issue is that requires fixing. Typically people would instrument latency statistics on the client side that is tuned specifically to their own application needs. I agree the approach of a 3rd party library to standardize latency stats makes more sense.

Comment From: PingXie

@QuChen88 I read three questions/callouts in your comment.

What performance metrics to collect on the client side?
Where to store the client performance profile?
How to collect these metrics?

These are all orthogonal questions IMO.

For the What question, I agree a simple e2e latency histogram doesn't tell the whole story. We do need the client config/load information, at the minimum, as you indicated above.

For the Where question, I also agree there will be multiple options. Storing them on the server side though has the benefit of a "self-contained" solution, especially from a self-managed users' perspective. That said, I can think of some down sides as well and one of them could very well be the interference with the main workload.

For the How question, I think GLIDE would be a great place to collect all these rich client side performance metrics :-)

Comment From: QuChen88

Agree, Glide is a good place to collect client side metrics.