Redis [BUG] Data loss when performing a graceful master shutdown under high load

Describe the bug

A master Redis instance replicating to a replica instance will not successfully perform a graceful shutdown (such as triggered by SIGTERM) in the presence of TCP congestion or CPU overload, losing significant portions of the data offset between the master and replica. This appears to be due to incomplete socket handling.

To reproduce

To reproduce I ran experiments with a low-powered computer as the replica, a larger computer running the master, and hacked up a script. All computers were running Linux. The replica computer is slower than the master, ensuring it can't keep up when under heavy load. It was tested with Redis 6.2.6 and b7afac6bc, which include the recent replication improvements (#9166), but with the same results in this area.

#!/usr/bin/env python3

import time
import redis

r = redis.Redis(host='master', port='6379')
r2 = redis.Redis(host='slave', port='6379')

pipe = r.pipeline(transaction=False)

for i in range(1, 5000000):
    pipe.set('mykey', i)

    if (i % 100 == 0):
        try:
            pipe.execute()
            last_i = i
        except redis.exceptions.ConnectionError:
            print('i:', i, 'last i:', last_i)
            time.sleep(1)
            slave_i = int(r2.get('mykey'))
            print('s:', slave_i)
            print('nice!' if (slave_i in (i, last_i)) else 'ate my data :(')
            exit()

    if (i % 50000 == 0):
        print('m: {} s: {}'.format(int(r.get('mykey')), int(r2.get('mykey'))))

pipe.execute()

Example output where 76607 writes were lost after several seconds of high load followed by a graceful master shutdown:

m: 50000 s: 38899 m: 100000 s: 76823 m: 150000 s: 115703 m: 200000 s: 155760 m: 250000 s: 198549 m: 300000 s: 242704 m: 350000 s: 285037 m: 400000 s: 325550 m: 450000 s: 362876 m: 500000 s: 406120 User manually issued a SIGINT to the server here. i: 523000 last i: 522900 s: 446293 ate my data :(

Expected behavior

Like when writing to disk (if enabled) at shutdown, make a thorough effort completing writing to replicas before shutdown. I did not find a detailed description in the documentation of what should be expected, but I interpret this section of the replication docs to imply writing to a replica should be equivalent to writing to disk, which I assume is only slow under high load, rather than losing data:

It is possible to use replication to avoid the cost of having the master writing the full dataset to disk: a typical technique involves configuring your master redis.conf to avoid persisting to disk at all, then connect a replica configured to save from time to time, or with AOF enabled. However this setup must be handled with care, since a restarting master will start with an empty dataset: if the replica tries to synchronize with it, the replica will be emptied as well.

Additional information

I've found two root causes:

writeToClient has logic to "send as much as possible if the client is a slave", but if it gets an error it will abort sending. Errors include EAGAIN or EWOULDBLOCK (TCP congestion). Thus graceful shutdown relies on the entire remaining replication data to fit in the socket buffer, else it will give up and cause data loss.
After sending sockets are promptly closed and the process exit. However the slave still sends periodic ACK messages and the socket has not been shutdown or handled this case in any other way, and if received before sending all data, that will abort draining the socket buffer (what did fit), closing the connecting by sending an RST.

I find that at low load (no congestion and able to complete draining the second in under one second) and high luck (hope last ACK was sent just before the shutdown began, so the master has a full second to complete draining the socket), the graceful disconnection of replicas works and the TCP connection is closed by the master by a FIN.

At a high load, presumably high enough to consistently fill the socket buffer, you get hit by both these issues.

I can confirm the first issue by setting the slave socket to blocking mode in flushSlavesOutputBuffers before writeToClient. This completed writing the application buffer to the socket buffer, as it blocks on congestion instead of returning EAGAIN and aborting.

Blocking works and is roughly equivalent to writing to (a slow) disk, however unlike a (one) file, when replicating the master could be configured with multiple replicas. They would be handled strictly one at a time, which could increase the time to shut down much more than if they would drain concurrently, depending on where the bottleneck is, so blocking is no ideal here.

This improves the result considerably but there's still a window at the end where the second issue may abort when draining what's left in the socket buffer. To confirm this I did a quick and dirty removal of the 1-second replication cron task which sends the ACK message. This kept the replica quiet on its end of the socket during the time the master shuts down, which allowed the socket to drain fully then and end the conversation with a FIN.

With both hacks in place the example script works "all the time", meaning, the several times I've tried, on Linux only, and with no other edge cases taken into account.

m: 50000 s: 39825 m: 100000 s: 79622 m: 150000 s: 120246 m: 200000 s: 158027 m: 250000 s: 198540 m: 300000 s: 236777 m: 350000 s: 276379 m: 400000 s: 315071 m: 450000 s: 354218 m: 500000 s: 392000 i: 530100 last i: 530000 s: 530000 nice!

Comment From: abhaagar

Some replication systems work using coordinate system for e.g in mysql slave server uses binlog position to get updates from from the master server. From this perspective, I am not sure if the issue is actually a data loss. Also, should we really compare write to disk to replication. Writing to disk completely may be necessary to guarantee ACID properties but that does not hold for replication. EAGAIN or EWOULDBLOCK are totally acceptable in case of non-blocking sockets, is also a non fatal error. The read/write should be retried when there read/write is allowed as per the socket event recognized by poll().

Comment From: bjosv

As a note; it's possible to reproduce the outcome by running the replica within a container on the dev. machine, i.e. same machine as the master but using different ports. The CPU resources can be limited using --cpus, like:

docker run --cpus=".09" --network host --rm -v $CONFIGPATH:/usr/local/etc/redis redis:6.2.6 redis-server /usr/local/etc/redis/replica.conf

Comment From: oranagra

There'a already a call to flushSlavesOutputBuffers() in prepareForShutdown(), and we can also maybe properly close these sockets, but i don't think we should keep the master alive (unresponsive to clients) and try to write additional data to sockets that are already full (EWOULDBLOCK).

We can also maybe change some logic in the replica to keep reading data from the master even if it failed writing, although that part sounds risky since it'll also delay re-connection.

However, despite both of these "best effort" fixes, the end result is the same. the only proper way to safely restart the master is to drain the traffic before doing that. i.e. either promote the replica with a FAILOVER, or just CLIENT PAUSE WRITE, and wait for the master_repl_offset to match.

Comment From: yossigo

@yesbox Do you have a specific use case where this is a problem than cannot be solved in the other ways @oranagra described?

It can be tempting to address this issue with something like SHUTDOWN DRAIN-REPLICATION which will block and wait for all replication buffers to flush - but will this be a complete solution? If our goal is to make sure replicas are fully synced when shutting down, we would still not achieve that:

We could hang forever, or alternatively time out and still lose data
How do we handle connection drops? If we allow reconnects we're practically still "up" and accepting clients as well.
Even if we've flushed to socket successfully, we have no guarantee the replica was able to read and load the data.

Comment From: zuiderkwast

The Kubernetes way of doing graceful shutdown of any service is SIGTERM. I suppose that's why you need this trigger, is that right @yesbox?

IMHO I think it would be good if graceful shutdown can be a little more graceful. Instead of shutdown_asap = 1 can we do "shutdown_soon", stop accepting writes, keep the event loop, wait for replication to catch up? Then we can force shutdown after a timeout (10 seconds or configurable?) if replication still didn't catch up. It's no guarantee but it's a better best effort perhaps.

Comment From: yesbox

Hi,

Thank you very much for the responses.

Addressing the feedback

My expectation is a more thorough effort completing writing to replicas before the process exit or slave sockets to close at graceful shutdown. I expect that this when this effort fails, which we can allow it to, an error per replica we failed to complete writing to is logged before process exit, so that the data loss if any is not quiet.

Kubernetes is a good example and I'm sure that's a problem I will have soon enough, but it's also about being a well behaved daemon in general. When shutting down or rebooting any computer I use, the OS will signal to all processes to shut down and allow them some time to do so. Same goes for stopping/restarting daemons using your init system or what have you. We should use this time to not lose data, as I believe is the intention. That is the type of particular system I'm working with today, where every process gets a SIGTERM from the init system, and no early warning system.

A database highly valuing racing to exit over not losing recently written data, to me goes against the principle of least astonishment. Up until investigating and finding the cause of intermittent data loss, everyone in contact with this system assumed shutting down was safe, so this is far from obvious and not what users expect. Therefore I don't think that was ever the intention, but rather an issue.

Referring to the Redis Signals Handling I found just now, I see we are valuing not losing data more highly than I first thought, when replicating to disk. Not only do we synchronous save and fsync (this much I expected), so this will already take an unbound amount of time by default, as Redis replicates to disk by default, but we also:

In case the RDB file can't be saved, the shutdown fails, and the server continues to run in order to ensure no data loss. Since Redis 2.6.11 no further attempt to shut down will be made unless a new SIGTERM will be received or the SHUTDOWN command issued.

I think this makes a good case for extending these values to slave replication, bringing it closer to the legacy expectations already in place for disk replication.

Proposal

Now combining the feedback @oranagra, @yossigo and @zuiderkwast I see an opening. It's essentially what @zuiderkwast already said but with more words, but here it goes. A proposal:

On SIGTERM/SIGINT, if there are the instance has slaves, shutdown_soon is changed from 0 to a timestamp we'll check in serverCron when non-zero, checking if the current time is higher than a configurable shutdown_timeout added to the shutdown_soon timestamp. That makes it still best effort, more so than disk replication, but we can now make a more thorough effort.

Now that we don't immediately jump out of the event loop, we can CLIENT PAUSE WRITE (conceptually or exactly) and wait for check for all currently connected replicas to report an offset equal to ours, which we'll now check since shutdown_soon is non-zero.

Bonus points for querying the replicas for the offset at a higher rate than the 1 second replica report interval, not to add up to 1 second of extra latency to the detection and exit, and widening the window during which we could falsely report data loss when the shutdown timeout and first report where all replicas caught up races.

When the master has drained to all connected replicas, then "shutdown_asap". Maybe log a line saying there were replicas connected and they are all good now, if that's what we do for disk replication already. Otherwise, not losing data should be the default expected case, and we don't speak up when everything is working as expected.

If the timeout is reached before the master has drained to all connected replicas, then too bad - log an error and "shutdown_asap".

If any replica connection is lost during the shutdown before reporting the final offset, log an error. If all replica connections are lost, log an error and "shutdown_asap".

That makes it three ways to complete shutdown_soon, excluding there being no replicas, in which case it need not enter this state in the first place: - All connected replicas reported no master offset, while the master no longer accepts writes. Excellent! - All connected replicas disconnected. Bad, but now we know! - Timed out. Bad, but now we know!

I'm leaning towards not letting any replicas or other clients connect during shutdown. We're going down soon, so rejecting is likely the best outcome for clients, so they may connect to a master which may accept writes. As for (re-)connecting replicas during shutdown - if they were not already connected, or so unstable that they could not keep the connection open, attempting to keep them up to date at this time goes beyond best effort. Given that the time to shutdown is short, their connection attempt could just as well have been too late anyway, as this is a race against the shutdown.

This way we do not complicate the synchronous prepareForShutdown procedure. We do get confirmation from the replicas by means of offset reporting and comparison before exiting which I think is somewhat equivalent to fsync returning. We keep the responsibility of being connected and receiving updates with replicas as this jibes well with how replicas are set up with replicaof, connecting to their master and not the other way around. We cannot hang forever, and if we fail despite our good efforts, it was not in vain because at least we didn't go quietly. 🙂

Comment From: yossigo

@yesbox I think we'll need to consider what additional implications and side effects this may have, but I mostly agree with this proposal.

Comment From: zuiderkwast

I'm willing to work on this once we sort out the some details.

FAILOVER or just CLIENT PAUSE WRITE?

CLIENT PAUSE WRITE ensures that there is no data loss, but to also keep the cluster as available as possible, we can promote a replica as soon as it has caught up replication. It makes the cluster accept writes again as fast as possible. Any downsides with this idea?

I suppose it would have to be slightly different logic for cluster and standalone mode (FAILOVER vs CLUSTER FAILOVER), right?

After a successful failover, we could send CLUSTER FORGET before we exit. How about that?

FYI: I've been informed about a project where they use a private fork of https://github.com/AmadeusITGroup/Redis-Operator to run Redis Cluster on K8s. It contains a Go program which starts the Redis process. It catches the SIGTERM and sends a FAILOVER command to Redis and then a FORGET after successful failover. Apparently, this works well for "pod delete", but it doesn't work for "worker shutdown". In the "worker shutdown" case, both the Go program and the Redis process will receive SIGTERM simultaneously and then Redis is fast enough to shutdown to cause data loss.

Comment From: yesbox

@zuiderkwast, I think just using CLIENT PAUSE WRITE and ensuring no data loss is appropriate as a first conservative default or option, and additional optimization could be optional and discussed as a separate issue.