Describe the bug

When running redis-benchmark, I found that the mset benchmark on latest unstable version(#cf83803) is obviously slower than v7.4.1, and there is about 5~10% performance degradation.

To reproduce Enter the installation directory of Redis, run the following script to benchmarking mset 100 rounds:

#!/bin/bash
TEST_ROUNDS=100
for i in $(seq 1 $TEST_ROUNDS)
do
    rm -f dump.rdb
    taskset -c 7 ./redis-server --save "" --daemonize yes
    sleep 0.5
    taskset -c 11 ./redis-benchmark -t mset --csv > ${i}.csv
    taskset -c 11 ./redis-cli shutdown
    sleep 0.5
done

Then run the python script to integrate 100 test results into one table:

import pandas as pd

round_number = 100
output_file = 'output.csv'
data = {}

# Read file data
for i in range(1, round_number + 1):
    file_name = f'{i}.csv'
    df = pd.read_csv(file_name, index_col=0)

    for test_name in df.index:
        for metric_name in df.columns:
            if metric_name == 'avg_latency_ms' or metric_name == 'p99_latency_ms':
                key = f'{test_name}@{metric_name}'
                if key not in data:
                    data[key] = [''] * round_number
                data[key][i-1] = df.at[test_name, metric_name]

# Generate a new DataFrame
keys = sorted(data.keys())
output_data = {str(i): [data[key][i-1] for key in keys] for i in range(1, round_number + 1)}
output_df = pd.DataFrame(output_data, index=keys)

# Save as a CSV file
output_df.to_csv(output_file)

I ran the command on v7.4.1 and latest unstable version, obtained their respective 100 rounds of test data, and calculated the average of MSET (10 keys)@avg_latency_ms and MSET (10 keys)@p99_latency_ms: | | MSET (10 keys)@avg_latency_ms | MSET (10 keys)@p99_latency_ms | | -------- | ------------------------------------ | ------------------------------------ | | 7.4.1 | 0.21391 | 0.41932 | | latest | 0.24663 | 0.49308 |

Draw the line chart to show differences more clearly: https://github.com/user-attachments/assets/bef1be1c-d3da-4d7e-b78d-8908fa83c649 https://github.com/user-attachments/assets/193c68e1-581d-49aa-9ae2-e4fba6e1e739

I also modified the processCommandAndResetClient to measure and print the latency of processCommand if mset command is processed:

    clock_gettime(CLOCK_MONOTONIC, &start);
    int res = processCommand(c);
    clock_gettime(CLOCK_MONOTONIC, &stop);
    long nanos = (stop.tv_sec - start.tv_sec) * 1000000000 + (stop.tv_nsec - start.tv_nsec);
    if (!strcmp((char*)c->argv[0]->ptr, "MSET")) {
        printf("%f us", nanos / 1000.0);
    }

Using redis-benchmark -t mset to trigger the instrumentation, the processCommand averagely costs 1.89 us in 7.4.1 while 2.17us in latest unstable version.

Expected behavior The mset benchmark on latest unstable version has same performance with v7.4.1.

Additional information With some debugging methods, I finally found that #2ec78d2 can be the main culprit of performance degradation. The commit added two calls to updateKeysizesList in the dbSetValue function, but it seems that the change introduced a little high overhead for mset benchmark. I ran the above scripts and drew the chart for the version exactly before this commit(old) and after this commit(new), and the chart show the differences: https://github.com/user-attachments/assets/89bce2a0-88b2-4ce8-a61a-4ff0c0cc21da https://github.com/user-attachments/assets/a672aac0-0ba5-461d-b23a-b2cc95bf9e0d

Also, measure the latency of processCommand. Before this commit, it costs 1.97us. After this commit, it costs 2.27us. If I delete two calls to updateKeysizesList in the dbSetValue function, then the performance regression seems disappeared.

Comment From: ShooterIT

thanks for your report @Gallopm @moticless could you have a look, maybe mset command make it obvious, it seems our Performance Automation doesn't have mset, only have hmset https://github.com/redis/redis/pull/13592#issuecomment-2413111949 @fcostaoliveira

Comment From: moticless

Hi @Gallopm, thanks for the invstigation. The function updateKeySizesHist on its own written rather optimized, I guess most of the penalty reaches from cache miss.

We can reduce the two calls into one call by extedning the function declaration to become:

void updateKeysizesHist(redisDb *db, int didx, uint32_t type, uint64_t oldLen, uint64_t newLen);
--->
void updateKeysizesHist(redisDb *db, int didx, uint32_t oldType, uint64_t oldLen, uint32_t newType, uint64_t newLen) ;

Such that at function dbSetValue() it will be:

updateKeysizesHist(db, slot, old->type, getObjectLength(old), 0);
updateKeysizesHist(db, slot, val->type, 0, getObjectLength(val));
--->
updateKeysizesHist(db, slot, old->type, getObjectLength(old), val->type, getObjectLength(val));

Before exploring this issue further, i think we should have first a benchmark coverage that will backup our effort.

@fcostaoliveira WDYT?

Comment From: fcostaoliveira

@moticless concerning:

Before exploring this issue further, i think we should have first a benchmark coverage that will backup our effort. @fcostaoliveira WDYT?

totally agree. will add it today EOD /tomorrow and reply back here.