Now redis use O3 level optimization that would remove the frame pointer in the target bin.
In the very old past, when gcc optimized at O1 and above levels, the frame pointer is deleted by default to improve performance. This saves the RSP and RBP registers and reduces the pop/push instructions. But it makes it difficult for us to observe the running status of the program. For example, the perf tool cannot be used effectively, especially the modern eBPF tool.
This is a test of mine, using the bcc tool to check the memory leak of redis. When adding a frame pointer at compile time, we can easily observe the call stack of the problem, but without the frame pointer, the tool is almost useless.
# test1: add frame pointer when compiling
$ sudo ./redis_memleak -p $(pidof allocs) -O '/proc/$(pid)/exe' --symbols-prefix='je_'
Attaching to pid 2623, Ctrl+C to quit.
[13:30:16] Top 10 stacks with outstanding allocations:
45 bytes in 45 allocations from stack
0x0000559b4789639f zmalloc+0x2f [redis-server]
0x0000559b478876ee serverCron+0x2e [redis-server]
0x0000559b47875e1b processTimeEvents+0x5b [redis-server]
0x0000559b47876e60 aeMain+0x1d0 [redis-server]
0x0000559b478700b7 main+0x4a7 [redis-server]
0x00007fdf47029d90 __libc_start_call_main+0x80 [libc.so.6]
# test2: omit frame pointer when compiling
$ sudo ./redis_memleak -p $(pidof allocs) -O '/proc/$(pid)/exe' --symbols-prefix='je_'
Attaching to pid 3504232, Ctrl+C to quit.
[17:28:30] Top 10 stacks with outstanding allocations:
116 bytes in 28 allocations from stack
zmalloc+0xe [redis-server]
Concerns about performance degradation: I tested the performance on Intel x64 CPU and no performance degradation was observed in get/set performance. I think modern CPUs, whether x64 or arm, already have enough registers. Therefore, the optimization of frame pointer is basically useless.
Other software: fedora38 also started to add -fno-omit-frame-pointer, we can refer to this blog article, which also mentioned:"Redis benchmarks do not seem to be significantly impacted when built with frame pointers." It is said that Meta and Google also added frame pointers when compiling their internal software.
Comment From: hpatro
@judeng Do you need it during production run of the engine? I generally add it during profiling/debugging.
https://redis.io/docs/management/optimization/cpu-profiling/
Comment From: judeng
@hpatro Thanks. Yes, I want to profile/debug redis by eBPF tools in product environment.
When there are too many redis online, we will always encounter various problems, such as memory leaks and latency spikes, which consume a lot of our energy. The frame pointer could help us debug faster if we can prove that it does not hurt performance.
I wonder if elastic cache and similar programs keep the frame pointers? Can you share your thoughts?
Comment From: judeng
@yossigo @filipecosta90 hi, hope you could have a look when you are free. My benchmarks(only set/get commands) show that frame pointer does not affect redis performance.
Comment From: yossigo
@judeng I don't see a problem with that, but we should probably make sure there is no degradation on Arm as well.
Comment From: judeng
@yossigo thank you, I will test Arm cpu. In theory, Arm has more general registers than x86, so performance is less likely to recede.
Comment From: zuiderkwast
If there is no performance degradation, this is great. It makes debugging and profiling easier and everyone will get better stack traces in crash reports.
Comment From: judeng
Last week, my colleague run benchmark in his MacBook which is M1 chip, the result show that frame pointer will not make the performance recession. The average of five tests: |version| set |get| |:---|:---|:---| |7.2.3 with frame pointer| 90338|94521| |7.2.3| 91490| 92210|
I'm wondering if anyone in the community can help with performance testing on an enterprise-grade arm server, which would be more convincing.
detail test log redis7.2.3 benchmark with frame pointer:
# redis-benchmark -p 6379 -t set,get -r 100000
cmd GET
throughput summary: 97847.36 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.346 0.128 0.343 0.471 0.607 1.495
throughput summary: 95785.44 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.361 0.112 0.343 0.535 0.983 3.735
throughput summary: 98619.32 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.348 0.136 0.335 0.471 0.727 1.879
throughput summary: 94073.38 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.374 0.136 0.351 0.583 1.167 3.063
throughput summary: 86281.27 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.406 0.144 0.367 0.695 1.383 5.111
cmd SET
throughput summary: 86805.56 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.411 0.136 0.367 0.727 1.479 4.263
throughput summary: 94428.70 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.377 0.104 0.351 0.527 1.207 8.767
throughput summary: 90991.81 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.397 0.144 0.359 0.583 1.431 9.279
throughput summary: 78125.00 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.461 0.112 0.383 1.031 1.775 5.607
throughput summary: 93720.71 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.380 0.168 0.351 0.527 1.503 4.255
redis7.2.3 benchmark without frame pointer(default):
cmd GET
throughput summary: 73583.52 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.482 0.144 0.399 1.007 1.863 7.967
throughput summary: 100603.62 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.339 0.144 0.335 0.447 0.583 3.167
throughput summary: 96061.48 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.356 0.136 0.343 0.511 0.799 3.687
throughput summary: 93808.63 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.378 0.112 0.351 0.559 1.231 6.015
throughput summary: 96993.21 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.351 0.128 0.335 0.479 0.767 2.807
cmd SET
throughput summary: 83612.04 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.429 0.128 0.367 0.759 1.599 10.231
throughput summary: 90252.70 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.396 0.136 0.367 0.607 1.239 15.711
throughput summary: 96246.39 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.370 0.120 0.351 0.503 1.223 4.823
throughput summary: 96432.02 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.354 0.144 0.343 0.503 0.695 1.959
throughput summary: 90909.09 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.394 0.152 0.359 0.631 1.503 5.247