Redis Replica PING at Random Interval

The loops counter (replication_cron_loops) hasn't reset/reinitialized after the ping. So even after increasing ping period, the next ping can happen at any time when condition satisfies. i.e. Assume, the ping period changed from 10 to 100 when loops counter (replication_cron_loops) value is 90. So, the next ping will occur immediately after 10 secs instead of 100 secs.

https://github.com/antirez/redis/blob/c1295bb9f1234f2a3dd3c1bf76eb5afcf8cf711a/src/replication.c#L3106-L3125

Comment From: antirez

But the next pings will be at the right interval right? So this is not a problem at all :-)

Comment From: ganeshkumarganesan

Hello @antirez

We have two different deployments(A, B), on each will have master and slave with replication configured. Also, the replication enabled between the two deployments (like Slave(A) -> Master(B)).

We had a case, reversing the redis replication between these deployments like promoting the Master(B) as Main and reverse the replication like Slave(B) -> Master(A).

We are trying to achieve reverse replication using psync instead of full sync(costly).

Steps :

Will break the replication between two different deployments(A, B) by setting the "slaveof no one" on Master(B).
Then will reverse the replication into Slave(B) -> Master(A).

If any ping has occurred in the Master(A) before the reverse replication (point 2), which will increase the offset and psync will be failed. So, we thought to increase the repl ping period into higher on Master(A) and finish reverse replication before ping can occur which will help.

Due to this, we can't be sure about it. Also, we couldn't know that next ping time outside. So it's hard to add pause on our end. Post made about this on the redis group.

Hope reset the loops counter (replication_cron_loops) value after the ping will make it consistent. It will enable to know about the next ping time while toggling.

Comment From: antirez

Oh I see now. Yep this was discussed before, I can no longer find where, I'm pinging @soloestoy and @oranagra for help if they can find the issue/PR. The idea is that maybe the good fix could be to remember the last offset which is not a PING, so that PINGs will not disturb the ability of the master to PSYNC with the new replica. I would love to join this issue with the other one if somebody can find it. Thanks.

Comment From: oranagra

@antirez i don't remember any such discussion (of remembering replication offset before PING). One solution in my eyes, is to expose the avoid-replica-traffic config which was present in the initial version of https://github.com/antirez/redis/pull/6497 (see: https://github.com/antirez/redis/compare/7f6c263b604cd7c20a01388aa6b074c50dfdcda3..6a8dd4d1d8676d942eae5bd5319370f41584bf83 ) this way, the orchestrator can suspend PINGS (and active expiry, etc), then wait for the offsets to match before breaking the replication and reversing it.

Comment From: antirez

@oranagra I'm sure we talked about it somewhere, but this should not happen in general, not just when the failover is orchestrated. Also when the master goes offline (but does not detect immediately the replicas are no longer connected) and send PINGs. And in the other side a replica is promoted by Sentinel or alike. The solution I wanted to put in place was the ability to remember two different sets of offsets, the usual offset and another one that remembers the latest "meaningful" offset that excludes PINGs. When the master returns into a replica, it uses such offset instead of the current true offset.

Comment From: antirez

What about 36099730? Ping @oranagra @soloestoy @madolson @yossigo and so forth.

Comment From: oranagra

@antirez i must say that i don't particularly like this extra complication.

considering this fix is only for this case: - unexpected (non-orchestrated) network disconnection. - during a period where there's no actual traffic!! (not common IMHO for non-orchestrated failover) - the old code would not have produced any inconsistency, but rather just an excessive full sync.

thing is that the replication mechanism is already quite complicated, making it even more complicated may mean that future changes are harder to review and increases the chances of adding bugs.

other than that, code looks good.

Comment From: antirez

Hi @oranagra thanks for your feedbacks. I think what this code is trying to solve is a lot more severe than what you think from the POV of the conditions needed. A few considerations.

Every time the master instance is severed, even a moment with high traffic, this will happen. For instance in a setup that is working with traffic, if you disconnect the master ethernet cable, wait for a failover, and connect the cable again this happens.
This also happens when you switch master/replica in an orchestrated way. Thanks to this patch, most deployments may avoid to see problems because of PINGs without using the rather obscure "prevent replica traffic" feature. Even if this is not the main goal of this code, which is "1".
This bug was reported several times, it is very common in the user base.
Avoiding full resyncs is a very big priority of the replication code.

IMHO also you may see this in a biased way since you are inside an organization that do not see what people that run their own instances normally see: non orchestrated failovers via Sentinel/Cluster. But the "street reality" of Redis is quite different from what you can see inside a place that is completely focused on the Redis operations for many customers. The better Redis Labs is at operations, the less you'll see this kind of stuff, but is normal users reality.

I'm totally against complexity, but considering that this is a pain from the POV of operations because people observe unexpected full resynchronization, and the code is small, easy to follow, and hopefully well commented, I believe all in all this is a positive change, but waiting for other feedbacks as well in order to have a more clear picture of the sentiment around this. Thanks.

Comment From: yossigo

Hi @antirez and @oranagra, this fix looks good to me and makes sense. I agree it adds more complexity to an already complex code but I think it's justifiable.

I'd consider adding a test for this case to catch future regressions (if and when replication code gets refactored and cleaned up).

Comment From: antirez

@yossigo thanks for the feedback, I agree about adding a test, I think I should be able to easily translate this:

#!/bin/bash
# 
# Related to issue #7002

redis-cli -p 6380 slaveof 127.0.0.1 6379
redis-cli -p 6379 config set repl-ping-replica-period 1
sleep 2
redis-cli -p 6379 incr counter
redis-cli -p 6379 incr counter
redis-cli -p 6379 incr counter
sleep 1
redis-cli -p 6379 get counter
redis-cli -p 6380 get counter
sleep 1
TRAN="MULTI\nDEBUG SLEEP 5\nSLAVEOF NO ONE\nEXEC\n"
(echo -e $TRAN | redis-cli -p 6380)&
clear; redis-cli -p 6379 info replication; sleep 1
clear; redis-cli -p 6379 info replication; sleep 1
clear; redis-cli -p 6379 info replication; sleep 1
redis-cli -p 6380 ping # Wait that it returns available
redis-cli -p 6379 slaveof 127.0.0.1 6380

Which I used in order to reproduce and validate the patch.

Comment From: antirez

I wrote a test and merged it. I guess we will be able to revert it very easily if future experiences will tell us so. Thanks for the feedbacks.

Comment From: antirez

(Leaving this open for now to get more feedbacks if there are any)

Comment From: soloestoy

It's very interesting and a little complicated...

And I have to say it introduced a new problem, if we rollback the server.master_repl_offset, we should rollback the replication backlog too, or that may lead to inconsistency, see PR #7031 .

BTW, there is still an scenario which may trigger full-sync:

replica B -> master A <- replica C
remote B to be a master, and A pings to C:

master B with offset 10; master A with offset 20 and meaningful offset 10 <- replica C with offset 20
change A as B's replica:

master B <- replica A <- replica C

now replica A reset offset with meaningful offset 10, and get continue from master B and disconnect C, but C doesn't have right meaningful offset and need a full-resync.

Maybe we can consider PING as out-of-band data from replication stream, but it will break PSYNC2 with the old version redis, just for your information.

Comment From: antirez

@soloestoy thanks for your feedback! I'm not sure however why the pull request is needed. I understand that I did the error of not updating the backlog_idx variable as well after re-reading my code, but why to trim a circular buffer of the final N bytes we need to memmove memory around? Updating the offsets we should always be able to remove the latest bytes from the buffer without issues. Or am I missing something?

Comment From: antirez

To be more clear, I've the following variables:

size = 20
histlen = 20
idx = 7
offset = 2500 (first byte in our offset)
master_repl_offset = 2519

This is our backlog:
+--------------------+
|CR,PING,INCR,INCR,IN|
+--------------------+

So here we just wrote "PING", the next command we'll write is at offset 7 just after the ping.
The first byte we have in our backlog, that is the "," after the PING, represents replication offset 2500.

Now if we trim just decreasing the history length and adjusting the idx,
with a delta compared to the effective offset of 5, we have:

histlen = histlen - 5 = 15
idx = idx - 5 = 2
offset = not changed, our first byte is the same.
master_repl_offset = 2514

our backlog after we just adjust the numbers is:

+--------------------+
|CR****,INCR,INCR,IN|
+--------------------+

Where with "*****" I'm indicating the empty space that will be used by the next writes.

I can't see where the problem is here.

Comment From: soloestoy

Hi @antirez , if I understand backlog_idx right, it's not only head but also the tail of the backlog when the backlog is full, and feedReplicationBacklog will append data after backlog_indx, so we need to trim the buffer I think.

Comment From: antirez

@soloestoy the tail of the backlog is only calculated using the history len and the idx AFAIK, unless we assume in the code for simplicity that if the history length is less than the backlog length, we can cut corners and simplify the code assuming all the backlog is used. I don't think I did this, but I can verify, did you checked my example above btw?

Comment From: antirez

For instance the offset calculation, does not take into account the total length of the backlog, so for instance the code that feeds the backlog is:

    /* Set the offset of the first byte we have in the backlog. */
    server.repl_backlog_off = server.master_repl_offset -
                              server.repl_backlog_histlen + 1;

In the above example if we write 10 bytes (after we adjusted for the effective index) we'll have:

Before writing:

size = 20
histlen = 15
idx = 2
offset = 2500
master_repl_offset = 2514

+--------------------+
|CR****,INCR,INCR,IN|
+--------------------+

Then we write 10 bytes: "MYNEWDATA!"

+--------------------+
|CRMYNEWDATA!INCR,IN|
+--------------------+

master_repl_offset is now 2524
new histlen is now 20 (it is capped at 20, the size of the buffer)
new idx is now 12
offset = master_repl_offset - histlen + 1 so 2524 - 20 + 1 = 2505

**Comment From: soloestoy**

```c
/* Feed the slave 'c' with the replication backlog starting from the
 * specified 'offset' up to the end of the backlog. */
long long addReplyReplicationBacklog(client *c, long long offset) {
    /* Point j to the oldest byte, that is actually our
     * server.repl_backlog_off byte. */
    j = (server.repl_backlog_idx +
        (server.repl_backlog_size-server.repl_backlog_histlen)) %
        server.repl_backlog_size;
    serverLog(LL_DEBUG, "[PSYNC] Index of first byte: %lld", j);

I overlooked these code above, server.repl_backlog_idx is not the begin of data in replication backlog, you are right, just reset server.repl_backlog_idx and server.repl_backlog_histlen is OK : )

Comment From: antirez

Oh I imagined @soloestoy that this code could confuse the reader because it adds to the index, but this is a trick in order to do just additions because the numbers are unsigned quantities, so we can just do additions and modulo arithmetic to avoid problems with overflows and incorrect results. Thanks for having the care of checking this. However thanks you your kind review, I saw that indeed I forgot to update the index in the original code!

Comment From: antirez

@soloestoy fixed in 97f1c808

Comment From: soloestoy

Hi @antirez , I think there are still two problem:

if delta is larger than server.repl_backlog_size we need reset it as 0, in case of server.repl_backlog_idx become negative.
server.master_repl_offset need to be reset as server.master_repl_meaningful_offset, or it will lead to data inconsistency, think about that if we reboot the new replica after switchover, the replica will use the old server.master_repl_offset to reconnect to master, but it's the wrong offset.

See #7032

Comment From: antirez

@soloestoy absolutely correct! Thanks I merged your PR.