Redis Monotonic Clock Going Backwards

https://github.com/redis/redis/pull/7644 introduced Redis monotonic clock, a cross-platform unified clock source for Redis. The clock is low overhead (measured <10 ns) as it reads timestamp counters directly through rdtsc (x86) and cntvcnt_0 (ARM). When reading timestamp counters from CPUs, one assumption is that all cores within a CPU fetch the same value. In modern CPUs timestamp counters originate from a quartz crystal oscillator ticking at a constant speed.

However, the assumption that all cores read the same value is fragile. CPUs allow instruction re-ordering through speculative execution. This means that timestamp counter values can be read out of order as opposed to the executable sequence the compiler produces. This can be problematic if different threads running on different cores are trying to read and compare the timestamp counter value. The AArch64 programmer manual states [1]:

Reads of CNTPCT_EL0 can be made speculatively. This means that they can be read out 
of order regarding the program flow. This could be important in some cases, for 
example comparing timestamps. When the ordering of the counter read is important,
an ISB can be used, as the following code shows

Similarly the Intel developer manual states [2]:

The RDTSC instruction is not a serializing instruction. It does not necessarily wait until all previous instructions
have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the
read operation is performed. The following items may guide software seeking to order executions of RDTSC

Using the monotonic clock code with a simple multi-thread program such as

void *monotonicReaderThread(void *vargp) {
  monotime prev = 0;
  for (int i = 0; i < ITERATIONS; i++) {
    pthread_mutex_lock(&mu);;
    monotime curr = getMonotonicUs();
    if (curr < g_prev) {
      printf("Monotonic clock value went backwards, curr: %llu, prev: %llu \n", curr, g_prev);
      exit(1);
    }
    g_prev = curr;
    pthread_mutex_unlock(&mu);;
  }
}

We can experimentally see the monotonic clock goes backwards by up to 1 microsecond (on ARM nodes), when comparing values read from different threads.

[1] https://tc.gts3.org/cs3210/2020/spring/r/aarch64-generic-timer.pdf section 3.1 [2] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html - 4-556 Vol. 2B.

Comment From: uvletter

Seems https://github.com/redis/redis/pull/10502#issuecomment-1102882065 discussed how to choose between RDTSC and stronger RDTSCP.

PS: since Redis is mainly a single thread model, can you observe the backward monotonic clock in a single thread on ARM nodes

Comment From: YacineTALEB

PS: since Redis is mainly a single thread model, can you observe the backward monotonic clock in a single thread on ARM nodes

No, we didn't see single threaded program observing older values when fetching the monotonic clock.

I note that there's a uint64_t RM_MonotonicMicroseconds(void) method exposed to modules. Threaded modules could be exposed to this issue. If Redis becomes multi-threaded at some point, this needs to be fixed as well.

Comment From: uvletter

I'm not familiar with module, I think maybe we could doc and comment the potential risk for RM_MonotonicMicroseconds, not sure. As for Redis itself, the only parallel part is the IO processing but it's limited in a few isolated functions, to make Redis more parallel is quite hard since almost everything is not thread-safe. I guess a warning about the backward behave across threads is enough.

Comment From: madolson

Are we proposing doing anything in this issue? I know Yacine and I talked internally about maybe clarifying some documentation. I agree this is an interesting case.

If Redis becomes multi-threaded at some point, this needs to be fixed as well.

Or we just continue to only use monotonic values with the context of a single thread.

Comment From: yossigo

@madolson Do you propose a thread-safe monotonic clock variant? For now, I think documenting it is a good idea.

Comment From: madolson

@yossigo I'm not proposing one for now, since I don't see a need for it. I guess this might be a question for your module folks if they are relying on it for any reason. @tezc I suppose I'll ping you since you added the module API originally, can you validate that this doesn't impact your use case?

Comment From: tezc

@YacineTALEB @madolson

Did you see this issue on ARM?

If I'm not mistaken, rdtsc() is syncronized among all cores & cpus on modern systems (it should always produce monotonic values). At least, this was true for recent x86 systems. I thought it was also the case for ARM machines. So, I guess this is not true if you saw the issue.

Btw if rdtsc() is not syncronized, don't we have a bigger problem:

1- int start = rdtsc();
2- load x, 100;
3- load y, 200;
4- load z, 300;
5- int end = rdtsc();
6- print("took :%d", end - start);

If rdtsc() is not syncronized and OS scheduler moves Redis thread to another core after step-1, end can get a smaller value than start.

The RDTSC instruction is not a serializing instruction. It does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the read operation is performed. The following items may guide software seeking to order executions of RDTSC

I think this warning is about single thread execution. For the above example, it says step-5 can be computed before 2,3 and 4. So, your measurement can be wrong. (but still, it shouldn't produce non-monotonic values).

Comment From: YacineTALEB

@tezc I haven't found any evidence that rdtsc() is self serializing, what we know for sure is that rdtscp() is. Theoretically your example could yield smaller values for end vs start but I guess the stars would have to align for that to happen. So far I wasn't able to reproduce the issue that we see with cntvcnt_0/arm with rdtsc/x86. It might be that the amount of out-of-ordeness on x86 is bounded (or not). Unless someone finds evidence that rdtsc is actually self-serializing, I personally consider the issue to be present on x86 as well.

Comment From: tezc

Just to be clear, I think there are two different things:

1- load cntvcnt_0 / rdtsc() is not serializing. So, it means these instructions can be executed before the previous instructions or the subsequent instructions can be executed before these. I think the one implication is the above example, it might make your measurement wrong. Maybe, you can come up with some other problematic scenario as well. This is not about multi threading and this behavior will not cause non-monotonic values (It doesn't say anything about counter going backwards). This is what I understand from the above manual snippets that you shared above.

2- The other issue is whether load cntvcnt_0 / rdtsc() can return non-monotonic values (With a single thread or multiple threads, can clock go backwards?) I thought this was not possible on recent x86/arm systems. As far as I know, these counters are "syncronized" among all CPUs in the system and they are incremented at a constant rate. If you really see clock going backwards, we probably need a fix for the single thread case as well because I feel like it's a system without a syncronized clock source. So, are you sure load cntvcnt_0 / rdtsc() returned a smaller value in your case?

Personal opinion: I like using clock_gettime() normally as it handles these details according to the platform/system but probably these instructions were faster in benchmarks.

Comment From: YacineTALEB

1) Your analysis seems correct to me.

2) The instructions read timestamp counters. In recent x86/arm platforms, the timestamp counters come from a separate quartz crystal oscillator ticking at a constant speed. All cores will read the same value, and that value should never go backwards [1].

Personal opinion: I like using clock_gettime() normally as it handles these details according to the platform/system but probably these instructions were faster in benchmarks.

Yes, the kernel implementation is defensive and puts barriers around these instructions [2, 3].

Edit: That being said, there's a cost to using clock_gettime(CLOCK_MONOTONIC_RAW). I measured around 30ns call time in both arm/x86 vs <10ns for rdtsc/cntvcnt_0.

[1] https://www.usenix.org/system/files/nsdi22-paper-najafi_1.pdf [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/arch_timer.h?h=v5.5#n220 [3] https://github.com/torvalds/linux/blob/d6ecaa0024485effd065124fe774de2e22095f2d/arch/x86/include/asm/msr.h#L197-L223

Comment From: tezc

Let's summarize our discussion for everybody,

1- Currently, we assume counters in x86/arm systems will be syncronized among all cores & cpus. Counters won't go backwards.

2- In our x86/arm implementations, we are missing a memory barrier. Implication is our measurement might be slightly off:

1- int start = getMonotonicUs();
2- load x, 100;
3- load y, 200;
4- load z, 300;
5- int end = getMonotonicUs();
6- print("took :%d", end - start);

Without a memory barrier, step 5 can be computed before 2,3 and 4.

Comment From: OfirMos

@YacineTALEB From the ARM documentation I understand that what you've suggested is impossible: `One physical system count value broadcasts to all cores. This means that all cores share the same view of the passing of time. Consider the following example:

Device A reads the current system count and adds it to a message as a timestamp, then sends the message to Device B. When Device B receives the message, it compares the timestamp to the current system count. In this example, the system count value that is seen by Device B can never be earlier than the timestamp in the message.`

Taken from here: https://developer.arm.com/documentation/102379/0101/System-Counter?lang=en

Also the locking ensure instruction synchronization even on execution time, assuming the mutex is not relaxed.

Comment From: YacineTALEB

Device A reads the current system count and adds it to a message as a timestamp, then sends the message to Device B.
When Device B receives the message, it compares the timestamp to the current system count.
In this example, the system count value that is seen by Device B can never be earlier than the timestamp in the message.`

You are referring to the actual execution schedule, but that schedule isn't guaranteed to follow the complied code flow. CPUs are allowed to speculatively execute instructions, whether you put a mutex or not will not prevent it. However, they do track memory dependency accesses and the CPU will make sure that you observe "sequential" execution (well based on the memory access semantics of your program).

The problem with reading timestamp counters is that they're not read from memory but a register, and as the program I shared shows, we might very well get a program that is protected through mutexes re-ordered due to speculative execution.

Comment From: OfirMos

@YacineTALEB I think that this might be causing the time to go backward: static void monotonicInit_aarch64() { mono_ticksPerMicrosecond = (long)cntfrq_hz() / 1000L / 1000L; It should be: mono_ticksPerMicrosecond = (long)(cntfrq_hz() / 1000UL / 1000UL); Since in the first case when the freq is >= to 2.1GHz mono_ticksPerMicrosecond will become negative.

Comment From: YacineTALEB

@OfirMos thanks for the pointing that out. Unfortunately, what you proposed doesn't fix the issue (anyways I used a slightly modified init method in my test).

@tezc do you see the need for a self-serializing monotonic time API? I think threaded modules is the most exposed use-case. That being said, I've not seen > 1us difference when comparing timestamps in measurements, but you never know what surprises you could get.

Other than that, what do people think about the name of this API, does it still make sense to keep this name given it can be non-monotonic?

Comment From: tezc

As I said above, I don't think this API is non-monotonic. Currently, we rely on "synchronized clock" in the system and it gives us monotonic timestamps.

Regarding instructions being non-serializing, probably its impact is slightly less accurate timing (in practice, probably it does not have any impact. In the worse case, I don't expect it to be more than a few nanoseconds.)

In summary, 1- I think, as long as we have synchronized clock in the system, we are fine. (Any counter-example is a big problem though). 2- Although I don't see any problem with the current implementation in practice, my personal preference would be using clock_gettime() > RDTSCP / ISB (memory barriers before timestamp reads) > currently what we have.