Redis PSYNC2 test sporadic errors.

This issue serves the purpose of tracking PSYNC2 test sporadic errors, and investigate them in order to fix the test timing issues yielding false positives, or looking closer to search for actual bugs. I'll update the top message with the currently active issues. Things are searched by grepping the log file of my CI setup, so I can have a very large history of test runs, and I'm able to examine if some failures are happening since the start or only recently.

Problem 1: Instance `#3` x variable is inconsistent

This looks like is happening only recently:

$ grep 'variable is inconsistent' -R .
./run_174881.html:Instance #3 x variable is inconsistent
./run_175366.html:Instance #1 x variable is inconsistent
./run_174283.html:Instance #0 x variable is inconsistent
./run_174579.html:Instance #0 x variable is inconsistent
./run_171368.html:Instance #1 x variable is inconsistent
./run_171175.html:Instance #2 x variable is inconsistent
./run_171709.html:Instance #1 x variable is inconsistent

It's very rare and checking the first instance of it in run_171175, shows that the first commit to give such issue is the following: 365316aa

Recent commits older than such commit are the following:

4447ddc8b Keep track of meaningful replication offset in replicas too
36ee294e8 PSYNC2: reset backlog_idx and master_repl_offset correctly
97f1c808c PSYNC2: fix backlog_idx when adjusting for meaningful offset
57fa355e5 PSYNC2: meaningful offset implemented.

So this may be a result of the implementation of the meaningful offset feature leading to some problem we yet do not know.

Note that the following commit was added after the first instance of this failure:

4447ddc8b Keep track of meaningful replication offset in replicas too

So if this failure is caused by the meaningful offset itself, is due to the meaningful offset implementation when the master switches to replica and trims its offset.

Comment From: oranagra

this is a can of worms, it takes long time to reproduce something and each time i get a different type of error so it's hard to concentrate on one. because it involves running many tests in parallel, i had to make the tests less chatty, and only print lots of context when fail. anyway... these are all reproduced on the latest unstable b89e6d74f

Problem 1: x variable is inconsistent

Instance #2 x variable is inconsistent

out of the two places that print this, i see (at least once) that it's the second one. and i see this in the log file of the one with bad counter:

1504:S 24 May 2020 18:06:25.098 * Caching the disconnected master state.
** it didn't trim anything from it's backlog, and i see:
** server.master_repl_offset, master_repl_meaningful_offset, master->reploff,
** and read_reploff are all set to 779454
779454 to exclude the final PINGs (0 bytes difference)
1504:S 24 May 2020 18:06:25.436 * Connecting to MASTER 127.0.0.1:21458
1504:S 24 May 2020 18:06:25.436 * MASTER <-> REPLICA sync started
1504:S 24 May 2020 18:06:25.436 * Non blocking connect for SYNC fired the event.
1504:S 24 May 2020 18:06:25.436 * Master replied to PING, replication can continue...
1504:S 24 May 2020 18:06:25.437 * Trying a partial resynchronization (request 5d8b801d1651ce4e0efcc470f7673269fd47e844:779455).
1504:S 24 May 2020 18:06:25.437 * Successful partial resynchronization with master.
1504:S 24 May 2020 18:06:25.437 * MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization.
1504:S 24 May 2020 18:06:25.437 # == CRITICAL == This replica is sending an error to its master: 'unknown command `$1`, with args beginning with: ' after processing the command '<unknown>'
1504:S 24 May 2020 18:06:25.437 # Latest backlog is: '"\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n"'
1504:S 24 May 2020 18:06:25.437 # == CRITICAL == This replica is sending an error to its master: 'unknown command `x`, with args beginning with: ' after processing the command '<unknown>'
1504:S 24 May 2020 18:06:25.437 # Latest backlog is: '"*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n*2\r\n$4\r\nincr\r\n$1\r\nx\r\n$1\r\n"'

Problem 2: Replicas and master offsets were unable to match exactly

some replicas are 14 bytes (one ping) more advanced. and remain this way for very long time (despite pings interval being reduced)

Master ID is 2
Log file: ./tests/tmp/server.32249.9/stdout
0: sync_full: 0
0: id1      : 59cdc2180dc8881ddca5461009d981ffc7ba030c:466511
0: id2      : 70501e4903225a6e8f25d184eac394415f87dc05:294212
0: backlog  : firstbyte=1 len=466511
---
Log file: ./tests/tmp/server.32249.7/stdout
1: sync_full: 0
1: id1      : 59cdc2180dc8881ddca5461009d981ffc7ba030c:466525
1: id2      : 70501e4903225a6e8f25d184eac394415f87dc05:294212
1: backlog  : firstbyte=15 len=466511
---
Log file: ./tests/tmp/server.32249.5/stdout
2: sync_full: 2
2: id1      : 59cdc2180dc8881ddca5461009d981ffc7ba030c:466511
2: id2      : 70501e4903225a6e8f25d184eac394415f87dc05:294212
2: backlog  : firstbyte=1 len=466511
---
Log file: ./tests/tmp/server.32249.3/stdout
3: sync_full: 0
3: id1      : 59cdc2180dc8881ddca5461009d981ffc7ba030c:466511
3: id2      : 70501e4903225a6e8f25d184eac394415f87dc05:294212
3: backlog  : firstbyte=15 len=466497
---
Log file: ./tests/tmp/server.32249.1/stdout
4: sync_full: 2
4: id1      : 59cdc2180dc8881ddca5461009d981ffc7ba030c:466525
4: id2      : 70501e4903225a6e8f25d184eac394415f87dc05:294212
4: backlog  : firstbyte=1 len=466525
---

Problem 3: too many full syncs

note that these are actually in a different (less chaotic test)

[err]: pings at the end of replication stream are ignored for psync in tests/integration/psync2.tcl
Expected '5' to be equal to '4' (context: type eval line 64 cmd {assert_equal [status $master sync_full] 4} proc ::test)

[err]: pings at the end of replication stream are ignored for psync in tests/integration/psync2.tcl
Expected '1' to be equal to '0' (context: type eval line 74 cmd {assert_equal [status $replica3 sync_full] 0} proc ::test)

i still don't have any actual contribution to that effort other than reporting what i reproduced. i will keep looking at the logs and try to figure it out tomorrow. or add more logs and try to reproduce. hopefully, there's just one bug that causes all these issues.

Comment From: oranagra

@antirez i've started investigating Problem 3 first, sine that happens in a test that doesn't have any random replication topology in it, so it is easy to compare logs of a successful run to a non-successful one. and then hope that maybe the problem it exposed will explain other failures too.

i run that test alone (commented all the other tests and server spin-ups of psync2.tcl, and run it with --loop --clients 64. what i see is that we get a replication request from a replica that's not even belongs to the set that test uses. i.e. the 5 servers of that test are using ports: 21902, 21898, 21877, 21870, 21872. and yet one of them gets: Replica 127.0.0.1:21866 asks for synchronization

so it seems that the solution of using find_available_port is not enough, and we need to allocate a dedicated port range for each client. while doing so it would be nice to also expose an argument about the range to bash, so that we can safely run two parallel tests in different consoles.

i'll get to work on that and then continue to investigate other issues. sadly, this doesn't seem to explain problems 1 and 2.

Comment From: antirez

@oranagra thanks, this explains many things about certain random failures. However unfortunately I think this is not related to the problem 1, that is the most serious one and magically emerged from the moment I added the meaningful offset. Moreover this problem has an history of happening in my CI setup where each run is executed always without any parallelism, and it didn't happen before that feature anyway.

Comment From: oranagra

@antirez do you run your CI with --clients 1 i don't think so.. that port collision can happen even without --loop as long as you don't use --clients 1.

anyway, i agree it's not the source of problem 1 which is the most scary one. but we do know that it started with the meaningful offset, and we do know it still exists in the latest commit in unstable..

Comment From: antirez

@oranagra You are right. What about, instead of using ranges for each client, use the "test server" in order to track every port that was ever tested, so that the same port is never allocated again? I could do this change easily I think.

Comment From: oranagra

@antirez doesn't the test clients allocate ports too (server.py calls find_available_port)?

Comment From: antirez

@oranagra yep but I mean, test client asks test server for a port to use, tests it, if it works, uses it, and so forth. The test server makes sure to always reply with a never used port so far.

Comment From: oranagra

@antirez so what if we run the test is a loop for days? eventually we'll run out of ports even if no longer used?

Comment From: oranagra

@antirez so i understand you're digging into 1, and we can maybe hope that i'll also explain 2. what shall i do meanwhile shall i proceed with 3? do you have anything against the ranges suggestion?

considering that 3 is not a real issue, we can treat it with lower priority and i can concentrate on 2, but 3 can also add noise and make it difficult for us to focus.

Comment From: antirez

@oranagra the idea of everybody apparently is to revert this change at this point, but I'm still digging into 1 today morning. Let's see if I can track it. Now I re-run the stress test commenting out the meaningful offset just to be sure, and I can no longer trigger the problem. I'm now examining the logs in order to see if I can reproduce it, even if I'm not sure the logs contain enough information to understand what's going on.

Comment From: oranagra

@antirez so in parallel you're doing both making sure the revert solves all problems, and also trying to understand the problem in the unreverted code?

i expect that even after the revert, you'll still encounter Problem 3, which you can ignore, but please keep an eye on other problems, especially problem 2, we wanna know if the revert is fixing it or not.

Comment From: antirez

@oranagra yes that's what I'm doing, and sure I'll also continue with the other issues. The problem is that now in order to trigger the problem 1 in a timely fashion, I had to comment the other tests, otherwise it takes a lot to happen. But later I'll proceed in the other direction, commenting everything but what causes problem "2" and "3".

Comment From: oranagra

Problem 2 is in the same test, so you'll hit it (unless you also disabled the check of matching offsets).

Comment From: antirez

@oranagra yes I've disabled it. I commented everything almost but the things needed for the test to run and the x variable test. Btw I think I understood what the problem is, jumping on the chat to tell you.

Comment From: oranagra

for the record, with joined forces we eventually found the cause of Problems 1 and 2, apparently this was b407590cee4988a1cde3afe7ec9b8ff0243462ef which actually re-broke #7306, since there was still a chance we don't disconnect a replica soon enough, and still feed a replica with data that will corrupt it's offset. apparently there are cases were both serverCron and replicationUnsetMaster are called from the same event loop, in which case replicationCron will reconnect to the master before cleaning the async free list from beforeSleep.

Comment From: antirez

We fixed these problems. Closing.

Problem 1: Instance #3 x variable is inconsistent

Problem 1: x variable is inconsistent

Problem 2: Replicas and master offsets were unable to match exactly

Problem 3: too many full syncs

Problem 1: Instance `#3` x variable is inconsistent