Bug Description

Missing CR-LF at the end of bulk string send from master to slave during FULLSYNC.

How to reproduce this bug (in a much easier way)

  1. Start redis instances in sentinel mode(1 MASTER + 1 SLAVE + 3 SENTINELs).
  2. Write some(around 10) data to Redis A using SET command.
  3. Use any method you like to listen to what message the MASTER receives.
  4. Send command SENTINEL FAILOVER to one of the SENTINELs.
  5. Capture the message the MASTER(now as a slave server after the failover) after +FULLRESYNC, which is a bulk string with content starts with REDIS0009
  6. You will find that this message does not follow RESP.
  7. Servers will perform as normal though.

Expected behavior

A normal bulk string ends with CR-LF.

As we can see in the RESP protocol documentation (https://redis.io/topics/protocol), a Bulk String is encoded in this way:

  • Starts with A "$" byte followed by the number of bytes composing the string (a prefixed length), terminated by CRLF.
  • The actual string data.
  • A final CRLF.

And here's the problem. In the message, which is a Bulk String, the final CR-LF is missing!

How did I find this bug

  1. Start 2 redis instances in sentinel mode(1 MASTER + 1 SLAVE + 3 SENTINELs) as Redis A and Redis B, on the same machine(Machine A).
  2. Write some(around 10) data to Redis A using SET command.
  3. Start redis-shake in sync mode to sync data from redis A to redis B, on Machine B. (redis-shake is a third-party tool that can sync data from one redis server to another. To the source server, it shows itself as a redis SLAVE server. Link: https://hub.fastgit.org/alibaba/RedisShake)
  4. Use Wireshark to capture packages send from Redis A to redis-shake(default port:9320).
  5. After redis-shake enters INCRSYNC mode, send command SENTINEL FAILOVER to one of the SENTINELs in Redis A.
  6. Wait until redis-shake crushes, then analyze packages sent to redis-shake.
  7. We shall find a package contaning message that does NOT follow RESP protocol, which caused the crash of redis-shake, as it could not parse something that is not RESP. The message can be found at the end of this issue.

Possible Cause

As I found in the latest codes, the only code that procudes string "REDIS0009" is at rdb.c:1234.

snprintf(magic,sizeof(magic),"REDIS%04d",RDB_VERSION);

So I guess, when redis starts full-syncing with a slave server, it sends its RDB copy, wrapped as a Bulk String, but unfortunately forgot to add CR-LF to its end.

The servers are going well with it though, otherwise this bug should have been found early enough, but its possibley because the slave does not check for the CR-LF ending at the same time. When the server provided directly by redis can normally peroform, a third-party server, such as redis-shake, will suffer from this problem, as they try to decode all Bulk Strings strictly by the RESP protocol.

Message Captured

redis-issue-captured.txt

Comment From: oranagra

@Crystala the data between the master and the replica is not strictly following the RESP protocol. The resp protocol if for client applications. It may look like RESP, but that's just an internal design choice. Note that there are other parts in the replication traffic that don't follow RESP, like empty newlines that are used as a form of a keep-alive. Even if we wanted to "fix" this and add a spare newline to make it look more like RESP, we could be facing some backwards compatibility issues. If you experienced some crash with RedisShake, i suggest you report it there. @soloestoy FYI.

Comment From: soloestoy

@Crystala thanks for your report, and as @oranagra said RESP is for client apps, about replication redis has its internal protocol especially in FULLRESYNC stage, so it's not a bug.

In another words, a third-party server (tool) like RedisShake should be compatible with the internal protocol, the third-party server developers should also know about the special protocol, and we (RedisShake) do handle it.

  1. After redis-shake enters INCRSYNC mode, send command SENTINEL FAILOVER to one of the SENTINELs in Redis A.

Back to the issue, after SENTINEL FAILOVER sentinel will send replicaof ip port to master, and then master kill all replicas connection including RedisShake, RedisShake would crash when disconnect with master. You can check logs, if it's not the scenario I said above or you wanna discuss more, you can report your problem to RedisShake project.

Comment From: Crystala

@soloestoy Thanks for your reply. Well, during my usage of RedisShake, this tool did disconnected with MASTER, but according to its log it did successfully managed to get reconnected, and the reason for its crushing was trying to parse the internal protocal you mentioned with RESP, which unfortunately is actually not RESP. Now I think I would try to fix it first, as I may not have the time to wait for RedisShake's update, while I am in need of a redis syncing tool that can handle a FAILOVER event. Is there anywhere I can get to know about redis' internal protocal (other than reading the code)? Couldn't find it here or at redis.io :(