Redis PSYNC2 partial command backlog corruption

I believe there are some problems that can cause data corruption with PSYNC2 in the latest 4.0 branch. Given that slaves blindly copy over their master's input into their own replication backlog over successive read syscalls, it's possible that with large commands or small TCP buffers, partial commands are present in this buffer. If the master were to fail before successfully propagating the entire command to a slave, the slaves will never execute the partial command (since the client is invalidated) but will copy it to replication backlog which may relay those invalid bytes to its slaves on PSYNC2, corrupting the backlog and possibly other valid commands that follow the failover. Simple command boundaries aren't sufficient to capture this, either, because in the case of a MULTI/EXEC block, if the master successfully propagates a subset of the commands but not the EXEC, then the transaction in the backlog becomes corrupt and could corrupt other slaves that consume this data.

Comment From: antirez

Hello Kevin, did you checked this experimentally or is your supposition based on reading the code/design? I'm asking since I believe at some point in the design we checked that this was not a problem because of certain reasons. But maybe I remember not correctly and this was considered under different circumstances compared to what you are referring to. Thanks.

Comment From: kevinmcgehee

I was able to trigger it experimentally in a Redis Cluster by artificially limiting the master to sending one byte at a time (in writeToClient) and then terminating the process midway through the transfer of a SET. The first slave was failed over to and the second slave performed a PSYNC2. The second slave's value of the key was corrupt (i.e. the remaining value bytes were incorrect since the original master died midway through transferring the value, "1\r\n$4\r\nPING\r\n1\r\n$4\r\nPING\r\n1\r\n$4\r\nPING\r\n1\r\n$4\r\nPING\r\n1\r\n$4\r\nPING\r\n1\r\n$").

Comment From: antirez

Looking at the code, the problem seems to be there. I'm not sure why I recall that we checked this issue explicitly. I'll investigate with more care... However if there is such an issue in practice, the fix apparently could be to always remember the execution offset of the slave, that is the offset at which the last command was applied (so the offset is no updated if the master client is in MULTI state), and after a failover, to trim the backlog to that offset.

Comment From: antirez

A much more rude fix could be, given that most of the cases the commands are fully transferred, to just check if the non processed backlog is zero and that the MULTI flag is not set. And in the contrary case flush the replication backlog and require a full sync.

Comment From: antirez

Note: without artificially reducing the write to 1 byte as you did, normally even if the master process crashed, the kernel socket output buffers have the write, so the cases where in practice the non processed replication buffer will be non-zero are the ones where there was a sudden event big in the master... Like a power outage, a network partition, or similar. Different is the case for MULTI: during big transactions this could be triggered easily in the practice. So the crude fix could kinda work most of the cases, but will force a resync in certain cases. What do you think about the proper fix VS the quick and dirty fix? On the other hand the proper fix is not very complex.

Comment From: kevinmcgehee

Sudden events are relatively common in situations that require unplanned failover. For the proper fix suggested above, if only the new master trimmed its backlog and all other slaves received the partial command, then none would be able to PSYNC, correct? Allowing this would require all other slaves to back out the partial command (cached master would be invalidated) and try to PSYNC with the lower offset, but that would hurt the normal connection-drop use case where the partial command is actually valid and will be completed after the PSYNC. To optimize it 100% a slave would have to try to PSYNC with both offsets (completed and incomplete) and have master choose which one to use. I don't think it's worth the effort to do this.

The main difference, however, between your two suggestions is whether the trimming of the backlog is implemented (instead of just dropping it entirely when this case is detected). Trimming can be done just by modifying the histlen and idx so seems to be worth the effort in the case that there are slaves that can still do a PSYNC after failover. Keeping track of the processed command offset should just require looking at the querybuf size and the MULTI state in each iteration of processInputBuffer. Therefore I like the proper fix suggested above.

Comment From: kevinmcgehee

I believe a related error is in the propagation of BRPOPLPUSH commands, which are propagated separately as an RPOP and LPUSH. It's possible that only one of these two are propagated, violating the atomicity of the command. Perhaps it should be propagated as RPOPLPUSH?

Comment From: antirez

@kevinmcgehee yep, or... to wrap it into MULTI/EXEC. Btw this problem with RPOP/LPUSH you discovered is apparently a more general problem AFAIK, unrelated to PSYNC2 bug in question here, so things are starting to get interesting... Thanks.

Comment From: kevinmcgehee

Agree it's not strictly related since it can happen on normal failover/AOF writes. I can file another bug if you want.

Comment From: antirez

@kevinmcgehee sorry for the delay here, I was working at a radix tree implementation to fix a well known Redis Cluster performance problem but this was a top priority in my mind and will fix it in the next days. Given that you have such a deep understanding of Redis, I wonder if you are near the bay area, since 2nd of June there will be (immediately after the Redis Conf) a Redis Developers Day where we meet in order to understand how to improve Redis internals. Would you like to join? The even in in the center of San Francisco. Thanks.

Comment From: yangsiran

Hi guys, a related bug in PSYNC2 may happen when a slave request a full resync with its master which itself is a slave of another node. The middle node would start dumping RDB and appending commands to the reply buffer in the middle of receiving a full command just as described above. This may cause the first command propagated to the end node become invalid, or even worse, causing protocol error and the end node free the link and request another full resync until the resync is fired in the proper offset.

Comment From: antirez

Thanks @yangsiran, I'll make sure to test for the scenario you described as well.

Comment From: antirez

Dear @kevinmcgehee and @yangsiran, and also @oranagra that participated to this discussion via private email exchange, I was thinking that perhaps the problem should be fixed (at least conceptually) where it originated, that is, what about appending in the backlog only after the relevant part of the command was processed?

Basically the code we have now in readQueryFromClient():

    if (c->flags & CLIENT_MASTER) {
        c->reploff += nread;
        replicationFeedSlavesFromMasterStream(server.slaves,
                c->querybuf+qblen,nread);
    }

Would go into processInputBuffer(), conceptually, and be called only after the command was processed, but before the buffer of the processed command gets freed. However I do not think that this can be applied in the way I'm recounting it, because processInputBuffer() consumes the protocol as it arrives, so the command stored as protocol is no longer available when a command gets executed. However this was just a concept, to create a real implementation however, we could:

Consider the buffer as partially applied in the replication backlog (by taking two offsets as we said) until the command is not really executed.
Increment our replication offset only when the buffer is applied, and not as data is read from the master like it happens know.
When creating the cached master remove the query buffer from the master client.

This way all the requests, and the slave-to-master offsets reporting, are always performed talking about offsets of fully applied commands, and this should simplify both the mental model and the implementation, making it simpler to reason about it. The part that could get a bit more complex is the backlog that needs to have this concept of applied and not applied part of the log.

Comment From: yangsiran

I am thinking maybe a separate buffer can be kept for the unexecuted part of the command (in the master client of the slave), and only increase the replication offset and propagate it after the command is processed. That should leave the backlog logic untouched. The only possible problem may be it need to keep every command in a multi/exec block in the buffer. After all they can't be really appended to the backlog until the whole transaction was executed. That may lead the buffer become large in some cases...?

Comment From: oranagra

@antirez one reason to consider fixing it "where it originated" and overcome the current complications in processInputBuffer, is that there complication also currently cause some performance loss (or at the very least, an inefficiency), so fixing it there will solve both problems. Currently processing a command or a list of commands results in many memmoves, each time trimming the beginning of the query buffer, preferably all the pending commands in the input will be processed without a single memmove, and the memmove will only move the leftover at the end which doesn't form a complete command. Then we'll need to keep an offset of the positions of the input buffer that were already executed, and this solves the requirement for the problem discussed in this thread.

Comment From: yangsiran

@oranagra On the slave side this is a good solution. For the top level master, that seems to be a trade off between memory spent on input buffer and CPU cycles spent on memmove.

Comment From: oranagra

IIRC unless a blocking command is executed, processInputBuffer (for a single client) won't exit until it finished processing all the commands in the input buffer. In that respect the memory reclamation by sdsrange during that loop isn't really relevant, eventually before exiting the loop, all the memory will be reclaimed. Furthermore, sdsrange doesn't actually free any memory, it only memmoves and updates the string length. This will be severe only when we have a client with very long pipelines (which is mainly true for slaves), but I want to stress that we are talking about many many memmoves, each trims one command, to trim it, it memmoves the entire query buffer (megabyes), and then move these megabytes again when trimming the next command....

Comment From: yangsiran

Make perfect sense for the long pipeline case. Memory reclamation seems only be useful with single long command with many small arguments coming one at a time... Anyway, choosing the right time to tirm the input buffer and keep the according offset is reasonable solution for both efficiency and problems in this thread. In case of a multi/exec block in slave, we can keep the unfinished transaction in the input buffer until it was executed, then move it into the backlog.

Comment From: antirez

While I agree with @oranagra that the current command processing is suboptimal in pipelines, I'm not sure that the two problems are related from the POV of the fix: sure it's the same portion of the code, but I've the feeling that we should start fixing the issue here in a clean way from the POV of the semantics, and later check the resulting code to see how we can avoid the processing.

The reason why I'm not sure the two fixes are related is that processInputBuffer() will consume input regardless of the fact it's executed or not, for example in the case of MULTI, so it is not a condition where we can say "if the function consumed the buffer we can consider it applied to the state of the database". For this exact reason, the otherwise very good idea of using two buffers proposed by @yangsiran (and acceptable even if there is some overhead because, anyway, it's going to happen only for the master client and only in chained replication) cannot be easily implemented for the same reason: we could check the query buffer length before calling processInputBuffer() and later, in order to understand how many bytes to retain by our other buffer, but it is not possible for the same MULTI problem...

Comment From: antirez

@kevinmcgehee in your last message you wrote:

For the proper fix suggested above, if only the new master trimmed its backlog and all other slaves received the partial command, then none would be able to PSYNC, correct?

Yes, this is correct, but I was thinking about the slaves always using, for PSYNC, the offset of the last applied command. Most of the times the difference would be a couple of bytes. So not sure why this should impact negatively the normal case of losing connection: the cached master will be cached discarding the query buffer and it's MULTI state, so that at this point, also the cached master should match the state of the last-applied-command offset. At this point the PSYNC is performed and we'll get all the relevant bytes. It is unlikely that if we ask for offset - something we'll miss the ability to PSYNC while offset will succeed, with the exception of clients wrapping long pipelines into MULTI/EXEC blocks... (in that case the difference of bytes could be non trivial and the backlog on the master could not be enough).

Or I'm missing something?

Comment From: antirez

On a side note, because this is more complex to fix than expected, I evaluated to start with the easy fix for 4.0 (I would like to move to from RC to GA, in theory...): just don't cache the master if the MULTI flag is set or if there is pending query buffer, and later find a proper fix for the performances and this problem as well. However note that this is not possible because of the problem with RDB generation on full resync as reported by @yangsiran: in that case, we must know the portion of that data that is yet to be processed, because we need to append it in the slave output buffer immediately after the RDB generation. This is just to say, because of this other problem I know I need to find a proper solution right now... and since this is a blocking feature for 4.0, this is in this moment the first priority I got.

Comment From: yangsiran

For the two buffer idea: checking the query buffer length before calling processInputBuffer() and later maybe not enough for deciding bytes to be retain. After all, consuming buffer doesn't equal to executing commands. The correct way I can think of is to feed the other buffer in processInlineBuffer()/processMultibulkBuffer() while we are triming it and decide to propagate it if not in MULTI state at the end of the loop in processInputBuffer() where we know a full command has been executed.

I wrote up a patch hastily just for the reference of this solution and things that may need to be considered in the final solution. It should pass the existing tests, but not sure it is 100% right for now. https://github.com/antirez/redis/pull/3931

Comment From: antirez

Hello @yangsiran, thank you for the help and the PR. Yep, your reasoning looks correct to me, another thing we could do is the following, which is a potentially simpler implementation of the same idea:

We add two items in the client structure, only used in this context:
- pending_querybuf, which is a string represents the buffer currently not applied.
- read_reploff, which is the offset that just represents all the bytes read, like reploff is doing currently in the implementation we have.
- The old reploff in the new implementation, represents the applied offset.
When we receive data from the socket, in readQueryFromClient(), we add data to both querybuf and pending_querybuf. We also increment read_reploff as we do currently.
Inside processInputBuffer(), when a command is called, we update the reploff field, if after the command gets dispatched (by the call() function), the MULTI flag is not set. Basically updating SHOULD be just doing: c->reploff = c->read_reploff - sdslen(c->querybuf).
In readQueryFromClient we no longer call replicationFeedSlavesFromMasterStream where we are calling it now. Instead we move the call after the call to processInputBuffer() in the following way: before calling processInputBuffer() we store the current applied offset c->reploff. Then after calling processInputBuffer() we make the difference with the new offset c->reploff. If the quantity is non-zero, we call replicationFeedSlavesFromMasterStream() using c->pending_querybuf as argument with the difference as length, and we also trim our c->pending_querybuf of the same length.

AFAIK this should give us the benefit of fixing the problem, let's say, at the right level, instead of updating our offsets in the wrong way and then remembering that we need to trim stuff.

When caching the master, at this point, we just need to discard the unapplied buffer, and set the read_reploff to the same value as reploff, and our master should be consistent.

Please could you help me verifying that this makes sense as an equivalent solution? It's similar to your proposal but I find it simpler to implement it if we decouple the fix from the internals of the buffer processing. Thanks!

Comment From: antirez

P.S. I'm asking @yangsiran but of course other reviews of my proposal will be highly appreciated :-) Before I start implementing this.

Comment From: antirez

@yangsiran additional hints about your PR:

You are basically accumulating the processed buffer, and finally sending it to the backlog as commands are executed. That looks good, should be equivalent to the proposal I wrote above, since it is just a different implementation of your double-buffer idea.
However c->reploff in the PR goes forward as data is read. AFAIK this should require some fix when the master is cached, and in general it is used in different places.

So what I'm trying to do with the additional steps and fields in my proposal is to totally isolate the reading part from the processing part, so that basically both the backlog and c->reploff in the master client, will always represent what we applied. However we still need to track the read offset in order to do the offsets calculation about what to send to the backlog.

Comment From: yangsiran

Your proposal looks correct for me. Scattering code around processInline()/processMultibulkBuffer() is not a very good idea... Using c->read_reploff - sdslen(c->querybuf) should be able to give us the correct c->offset. c->reploff in my PR is updated when commands get propagated. It should be able reflect the applied offset.

Comment From: antirez

Thanks @yangsiran, well your implementation has the good thing of looking, at least in the diff, simpler than what I described above honestly, even if it's less split apart from the internals, so there are merits in your PR as well AFAIK, especially now that you say that c->reploff is correctly updated to reflect the applied offset. I'll re-check again.

Anyway I'll write in the commit message of my implementation that we co-authored the implementation if I pick the one that I proposed, since your PR and description of the dual buffer was a very interesting concept to reach a solution. I had thought about it but did not realize that with minimal changes we could fix the MULTI issue. Thanks I'll keep you posted.

Comment From: antirez

Hello, I just pushed a fix. A review would be very appreciated, the code looks pretty small after all, considering how nasty was the issue. Thank you for any further help, since you already gave me a lot.

Comment From: yangsiran

Hi @antirez, thank you for engaging in the problem timely. The code seems consistent with the description above, with just a problem that confused me. A slave should have mark the CLIENT_MULTI flag and queued commands when it has received the first few command in a transaction. In the current solution, a partial resync at this point will start from the first command in the transaction (MULTI). Without calling discardTransaction(), this may make the MULTI fail and commands get queued twice?

Comment From: antirez

Hello @yangsiran, you are perfectly right, thanks for reviewing! Based on what you said, I was trying to look at other conditions that the client may have that we should clean. Apparently there are not other similar issues. I don't like we do not have tests for it in the current replication units...

Btw this also does not fix the problem with RPOPLPUSH identified by @kevinmcgehee, but I'm taking this as last item since it's conceptually a separated problem that was there before PSYNC2 as well (however it will be fixed as part of this issue).

I'm pushing a fix for the problem you identified. Thanks.

Comment From: yangsiran

Agree! Test case is needed for such subtle issues in this thread.

Comment From: kevinmcgehee

I reviewed the committed changes and they look to solve the issues to me. I think there are a few side effects here:

1) All slaves will keep a duplicate buffer of not-yet-applied commands that has no bound. For large commands we expect the querybuf to be bounded and disconnect the client, but for long transaction sequences we will build up a large buffer until it is executed. The only bound here is that the entire transaction must have fit in the master's client output buffer.

2) The entire duplicate buffer is only really useful when there are subslaves with a larger COB or the PSYNC backlog is larger. Otherwise it's not very memory-efficient because some bytes will eventually go nowhere on application.

I think it's fine for the slaves to clear their querybuf and try to PSYNC from the last applied boundary. For long transactions, it may break some cases where PSYNC would succeed today (i.e. the connection drops and the entire transaction is not in the PSYNC backlog, but the remaining bytes not consumed are) but this will likely not happen often.

Comment From: yangsiran

I think the bottom line here is that there are still not a very good solution for the subslaves problem... So extra memory has to be spent to keep the unprocessed part of a large transaction. We could consider delay the full resync until a safe point. But in that case the time of a full resync is unbound.

Comment From: antirez

Hello, I agree with everything @kevinmcgehee said, technically speaking (and thank you for the review), but I'm less concerned from the POV of memory utilization because of the following: even if we don't account for the additional buffer introduced by the patch, there is the MULTI state in client->mstate. For each command the multi state is very wasteful:

typedef struct multiCmd {
    robj **argv;
    int argc;
    struct redisCommand *cmd;
} multiCmd;

Every argument is split into objects and there is the pointer to the command, so basically the additional query buffer is just a constant factor added to what we already had: O(N) memory utilization while accumulating the transaction. However the protocol itself is a lot cheaper to store compared to splitting the argument into objects, so the total memory utilization here is going to be, if we consider the cost of taking the state of MULTI, something like 30% more than it used to be (for the common case of small arguments).

Because we had to pay a lot of overhead anyway, this addition is not too bad AFAIK.

What Kevin says about certain edge cases where before PSYNC could help and now it cannot, it's also a valid point in my opinion, and I agree that's rare. However an interesting counter-point is that it is conceptually much cleaner to request offsets at command boundaries, it reduces the complexity of the system when reasoning about it. Random things that happen now:

When the master is cached we can flush the multi state and buffer! Otherwise when a slave disconnects almost at the end of a large transaction, it is forced to retain a lot of memory used for no reason. So we also have some good effect on the memory side, not just bad.
In theory we could easily implement automatic bug detection in PSYNC by checking that the backlog requested offset is always preceded by \n.
The slave selection is more correct in Sentinel / Redis Cluster. To pick a slave with an higher offset which is not applied is meaningless. Now we compare the amount of data received by slaves in a more sane way.

Ok... I'm selling you the story from the best POV, I just mean, is a reasonable price to pay for conceptual simplicity and the resolution of a number of bugs.

Btw, probably we fixed yet another bug here, which is a plus. That is the PSYNC after the the RDB save/restart/load had a similar bug AFAIK.

Comment From: antirez

P.S. the collaboration we had in this issue is also something very worthwhile for me that I hope to repeat soon, and maybe not for bugs but for improvements, hopefully :-)

Comment From: antirez

About the BRPOPLPUSH propagation issue, not happening in a transactional way from the POV of AOF/replication, @kevinmcgehee asked why it was not replicated in a vanilla way. Well, since the fix is going to be, instead, wrapping the generated commands into a MULTI/EXEC block, I want to explain why there is this odd design inside Redis. The problem is things like that:

Client 1: BRPOPLPUSH list_a list_b 0
Client 2: BRPOPLPUSH list_b list_c 0
Client 3: RPUSH list_a foobar

The result of the above commands is all computed at the moment the client 3 executes the command, however in order to propagate the final result, since there could be chains like the above, it is much simpler to compute each part as it happens, splitting the pop and push into separated steps. So, most of the times, what happens could be actually captured by a single command invocation, but in practical terms it was very hard, so a different implementation was used.

Comment From: antirez

After the recent changes, for the first time I'm seeing this failing in the Redis CI:

*** [err]: Test replication partial resync: ok psync (diskless: yes, reconnect: 1) in tests/integration/replication-psync.tcl
Expected '0e806f6f9e348c0b11238fd3d16bdc752e841b4b' to be equal to '1cde6e4aa82cb08cf10d45dd0669f86dc93617a7'
Cleanup: may take some time... OK
child process exited abnormally

Maybe something is not correct about the patch...

Comment From: yangsiran

This test case can pass on my machines. I think the behavior replication-psync test is machine specific... as before.

Comment From: antirez

Yes it usually passes in my machine as well, but from time to time, the CI reports an error. This error was not reported in the past. This is usually a bad sign... that is, probably there is a bug in my patch that only happens in specific circumstances.

Comment From: antirez

p.s. I restarted the CI enviornment and make a proper cleanup, to see if I catch more of those failures to analyze them here: http://ci.redis.io/

Comment From: antirez

An update on this issue: I'm able to experimentally verify that there is a bug. I create three instances replicating in a chain: 9000 --> 9001 --> 9002. Redis benchmark increment counters with a big pipeline, at the same time the connection with the master is broken (using client kill type master) in the 9001 and 9002 instances. At the end of the test, the benchmark process is killed, and DEBUG DIGEST is used in order to detect inconsistencies. After trying a few times I just got:

9000: 0da65783e718b379bde301956df6e02350cb5097
9001: 2d0cc96e59086645c0ff7d250061b56381ce972f
9002: 0da65783e718b379bde301956df6e02350cb5097

Note the irony of the chained slave receiving the correct data... while the one in the middle did not. Now the problem is that the way I can reproduce it does not make for a trivial analysis, but I'll try to narrow it in the next days.

This is the Tcl script to reproduce:

#!/usr/bin/env tclsh

proc cycle {} {
    exec redis-cli -p 9000 config set repl-backlog-size 5gb
    exec redis-cli -p 9001 config set repl-backlog-size 5gb
    exec redis-cli -p 9002 config set repl-backlog-size 5gb

    set bench_pid [exec redis-benchmark -P 32 -p 9000 -n 10000000 -r 1000 incr __rand_int__ > /dev/null &]
    puts "Benchmark pid is $bench_pid"
    set start [clock milliseconds]
    while 1 {
        set elapsed [expr {[clock milliseconds]-$start}]
        if {$elapsed > 10000} break
        if {rand() < .05} {
            puts "Disconnecting 9001"
            exec redis-cli -p 9001 client kill type master
        }
        if {rand() < .05} {
            puts "Disconnecting 9002"
            exec redis-cli -p 9002 client kill type master
        }
        after 100
    }
    exec kill -9 $bench_pid

    for {set j 0} {$j < 5} {incr j} {
        set id1 [exec redis-cli -p 9000 debug digest]
        set id2 [exec redis-cli -p 9001 debug digest]
        set id3 [exec redis-cli -p 9002 debug digest]
        puts "$id1\n$id2\n$id3"
        puts "---"
        if {$id1 eq $id2 && $id2 eq $id3} break
        after 1000
    }
}

cycle

Comment From: yangsiran

I reproduced the test in my environment, the result looks similar. The difference in my case is an additional *2\r\n$4\r\nincr key in some slave nodes. I think the reason is that some arguments in the command have been parsed into c->argv when disconnecting with master. So when receiving the same command from the correct offset, the c->argv array may have redundant arguments in the head.

Simply clean the reset the client make Redis behave correctly under this test, in my environment.

Comment From: antirez

Yes @yangsiran, that is exactly what was happening, at least this is the first bug I found, but I'm not sure it is the only one, basically the function that caches the master had a missing resetClient() call, so the client was not in the correct state in order to accept a new clean command. However I'm not sure if this also explains another random failure I saw in the PSYNC2 test, where sometimes the PSYNC failed completely (so the assertion in the test checking that there were exactly 4 full resync fails), I was trying to trigger a protocol error that disconnects the master, but so far no luck, however it may just depend on the exact stream of commands passed, the one created by redis-benchmark in the script above is always the same length so perhaps is just that the normal workload creates more complex problems. However at least after applying the fix for the argv problem, I'm not seeing other errors, but at this point I wonder if there is more odd state in the client structure that I'm not initializing correctly. Checking field-by-field... if you have any idea I would love to get feedbacks. Thanks!

Comment From: antirez

The only additional thing that I found so far, that predates this bug, is that the output list in the master client (client->reply + client->bufpos) is not freed, so the slave may send invalid data and half-commands to the master after a reconnection. Not sure how bad this could get but surely does not look great.

Comment From: antirez

I can trigger the bug after just a few cycles even using a very small backlog of 10MB for each instance, by dropping the -P32 flag from the benchmark (so a lot less traffic gets produced). At this point I'm rewriting the script to be an integration test for Redis. On the other hand after deploying the fix for the recent bug, the CI system looks happy, so probably apart from the mistake I made with the missing resetClient() call, the patch is otherwise good. 🤞

Comment From: yangsiran

Some note about cases where PSYNC fail completely: When I turned on VERBOSE log in Redis, the test above actually have two kinds of oddness. First is the key *2\r\n$4\r\nincr (notice that its len is the same as every other key, 12). The other is Protocol error (expected $ but got something else) from client which caused PSYNC to fail completely.

The first case corresponds to moments that c->bulklen has been parsed. And the the second case corresponds to moments that c->bulklen has not been parsed and $ is expected.

This may explain the occasional failures of PSYNC test...

Comment From: antirez

@yangsiran great! Exactly, I wanted a proof of that... and now I've it. I failed to realize that the protocol errors are only logged when the logging level is VERBOSE. Especially if it is the master to create a protcol error, perhaps Redis should log it more clearly, including some excerpt from the protocol?

Comment From: antirez

p.s. now we have the test above part of the integration test, so hopefully this will protect against the same kind of errors.

Comment From: yangsiran

Protocol errors from non-user client are subtle to identify, and often related to program error. In the current default setting you have to suspect that there is a protocol error (and locate some part of code) before your can actually observe one. 😂

Comment From: carlwgeorge

The regression test for this seems to have an issue, please take a look at #4360.

Comment From: madolson

This seems fixed for all intent in purpose, there is another issue with more info.