As discussed in https://github.com/redis/redis/pull/10502#issuecomment-1097937901, it seems that some Linux images on AWS wrongly use xen as clocksource which results in sub-optimal performance (true system call when calling clock_gettime).
We would like to add a runtime check to warn users about it (much like what we do with THP). Additionally, we would like to add some INFO fields reporting the clock source (probably both what we detected the kernel is using, and the one used by our monotonic.c)
Comment From: yoav-steinberg
As a reference I'm putting some timing measurements here based on the custom monotonic.c mentioned in #10502:
| instance | clock source | ns per clock_gettime |
|---|---|---|
| c5.large (Nitro Hypervisor) | tsc | 23 |
| c5.large (Nitro Hypervisor) | kvm-clock* | 27 |
| m4.large (Xen Hypervisor) | tsc* | 27 |
| m4.large (Xen Hypervisor) | xen | 596 |
| c6g.large | arch_sys_counter* (Graviton2) | 35 |
* AWS recommended setting.
Comment From: yoav-steinberg
Given the above table I'm not totally sure how to move forward. Options:
1. If we're on Linux and on x86 then warn if the system isn't configured to use tsc. This will guide the user to probably the best possible performance since tsc seems to be best when available. The downside is that it might not be the recommended (safest) option, like in the case of the Nitro Hypervisor.
2. Same as above but instead of doing the check only on x86, do the check only if tsc is available. On non x86 systems it isn't available so there's no problem. But then if it's not available on an x86 system this might indicate some system issue that needs to be taken care of, and the user won't get a warning.
3. Warn user in any case we suspect a badly configured clock source based on profiling only (not on system configuration). So, for example, if clock_gettime(CLOCK_MONOTONIC) takes more than 100ns we always print a warning.
@oranagra @madolson Waiting for your comments.
Comment From: madolson
I would vote to merge 2 and 3. "If we are running on x86, we are not using TSC, TSC is available, and our current clock is slow (100ns)." The warning should be generic though, "A faster clock may be available, investigate your options". It should also be added to the list of warnings that are suppressible.
Comment From: yossigo
I think 2 makes sense, I think runtime benchmarking libc functions to produce a warning might a step too far (and probably too prone to false positives as well).
Comment From: oranagra
I don't like the idea of measuring the performance, it could probably produce false report on systems under stress / valgrind or who knows.. we also can't know what's recommenced or safe to use, so i think the tools in our tool kit are: 1. check what the kernel is configured to use. 2. check what's available (do we have a way to tell that?) 3. have a list of platforms that come with bad defaults, and specifically target our code for these.
i.e. i don't think we need to have a generic mechanism for clock configuration problems. just warn about the common problems (in this case Linux / EC2)
Comment From: yoav-steinberg
check what's available (do we have a way to tell that?)
Yes we do.
Comment From: oranagra
so again, considering we don't know what's recommended or wrong to use, i think we should just try to detect the common problem we see with bad defaults on EC2.
so how about if we see it's using xen, but tcs is also supported, then we warn?
Comment From: yoav-steinberg
There's only one case we saw really bad results: xen. But I'm a bit worried about warning exclusively on xen because according to the kernel source it seems that the xen and kvm-clock sources have a similar implementation. I found a note here hinting that the problem with xen is some regression (not specifically related to ec2) which leads to a syscall instead of vdso making the call very slow. In theory this regression may be fixed at any moment.
We can however start with an exclusive warning on: xen when tsc in available, and continue from there.
Comment From: oranagra
maybe instead of testing the time it takes to call clock_gettime, we can somehow get system call counters and realize if it results in a real system call or a vdso?
Comment From: yoav-steinberg
maybe instead of testing the time it takes to call clock_gettime, we can somehow get system call counters and realize if it results in a real system call or a vdso?
I think this would be ideal. I'll check if I can find a good way to do this (@yossigo If you have any idea let me know).
Comment From: yoav-steinberg
Draft PR: #10636