Redis [BUG] unexpected \n after the EOF marker when extracting an rdb from redis using diskless replication

When extracting an rdb from redis using diskless replication, I get an unexpected \n after the EOF marker. To reproduce the issue, start a redis and run the following:

telnet localhost 6379
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
REPLCONF capa eof
+OK
REPLCONF rdb-only 1
+OK
SYNC


$EOF:4e0367cbc2c3a44ed03a8bbb470d46c381a8236b
<rdb snippet>
4e0367cbc2c3a44ed03a8bbb470d46c381a8236b

The issue started after https://github.com/redis/redis/pull/10092. After investigation, I found a race condition between getting the REPLCONF ACK from the application that tries to extract the rdb and the replication cron.

Before the changes, once the redis transfers the rdb, it updates the replstate to SLAVE_STATE_ONLINE. But now it skips this part if the client is marked with CLIENT_REPL_RDBONLY. As a result, the replication cron keeps sending \n, which blocks the application from detecting the EOF correctly.

My suggestions: 1. Add SLAVE_STATE_ONLINE to the client flags regardless of CLIENT_REPL_RDBONLY. 2. In the replication cron, we should send \n only if * slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START; or * slave->replstate == SLAVE_STATE_WAIT_BGSAVE_END and server.rdb_child_type != RDB_CHILD_TYPE_SOCKET and !(slave->flags & CLIENT_REPL_RDBONLY);

IMHO, we should go with the second option. I can submit a PR with the fix.

Comment From: moticless

Looks like the stuffing of newlines is not the real issue here. It shouldn't supposed to be. Just keep-alive that being ignored by client. Indeed, the delay got changed following #10092 and newlines as side effect.

The problem is that the server never closes the connection once it is done. I compared latest versus earlier version and got tcpdump. At earlier version the connection is FIN close after complete transmission, without any need to replconf ack. I also customized latest version by removing entirely newlines transmission, and still, the connection left open.

Comment From: moticless

I reviewed #10092. The following section of code that handles only-rdb moved from replicaPutOnline() to replicaStartCommandStream():

    if (slave->flags & CLIENT_REPL_RDBONLY) {
        serverLog(LL_NOTICE,
            "Close the connection with replica %s as RDB transfer is complete",
            replicationGetSlaveName(slave));
        freeClientAsync(slave);
        return;
    }

Until now the server closed the connection right after transmission completion. Now it waits for replconf ack <offset> and only then closes the connection. I guess that your client doesn't send replconf ack which leaves the connection open. Since there is no real logic that handles replconf ack <whatever-offset> in case of rdb-only, I think we should revert the code mentioned above.

Comment From: valentinogeron

@moticless - the race condition still exists even if the client sends replconf ack once it gets the full rdb. IMHO there is a benefit in waiting for an ack from the client that confirms it got the full rdb.

Comment From: moticless

Unlike PSYNC command, in case of SYNC command we don't expect to receive reploconf ack. I think it will be better to revert the section code above (if you can verify that reverting the code resolves your issue, it will be great).

Comment From: oranagra

Trying to recap: 1. the problem isn't just that newlines are still being sent, it's the fact that the receiver might not find the EOF marker since it searches for it only on the end of each payload it reads from the socket. 2. the problem only affects CLIENT_REPL_RDBONLY since only in this case replicaPutOnline doesn't change replstate to SLAVE_STATE_ONLINE (but does change server.rdb_child_type) 3. this is a bug in redis 7.0

regarding the 3 solutions that were proposed, i think i'd rather close the connection ASAP, if that would have for some reason imply the data can't be fully received on the other hand, then it would mean that calling refreshGoodSlavesCount is dangerous.

i would say that also re-ordering the lines in replicaPutOnline in a way that we update slave->replstate before we bail out on CLIENT_REPL_RDBONLY is correct (the replica is technically online, even if not a replica), and we just need to avoid reaching refreshGoodSlavesCount. but since we're now going to close the connection, there's no point in updating slave->replstate

@valentinogeron can you make that change and run all the tests to validate we didn't miss anything?

Comment From: valentinogeron

@oranagra - yes, I'm on it