Redis Sentinel does not trigger failover in case of master node reboot

Master reboot is detected by sentinel but it doesn't trigger failover.

[7678] 24 Sep 10:53:16.164 * +reboot master redis1 192.168.250.128 2148

Keeping old master will create possibility for data loss as slave sync is usually faster that persistence. Using AOF and syncing to disc with every write is the only case when master should have newer data.

Comment From: jasongoodwin

See also: https://github.com/antirez/redis/issues/1281

Comment From: jasongoodwin

See also: https://github.com/antirez/redis/issues/1281 I agree - I don't like this behaviour - it is fairly high risk in production. I slave restarting in production could take down an app if someone doesn't understand this behaviour.

The current solution I'm moving toward is to use a client (or extend an existing redis client in my case as there aren't many redis clients with sentinel support out there) to subscribe to the -slave-restart-as-master pubsub channel. That message should be treated the same as a -failover-end message triggering the reconfiguration of the client's state.

The problem is if you're using the reconfigure script hooks and someone triggers a slave-restart-as-master, your production system could get into a bad state as that hook isn't called on -slave-restart-as-master.

This behaviour really needs to be documented in the redis-sentinel documentation.

Comment From: nikolayk812

@antirez this is still happening even in 4.0.1 version.

1:X 03 Oct 17:43:05.094 * +reboot master master-name 10.10.10.243 6379

and data got lost.

Comment From: antirez

I've the feeling that, by default, failing over on reboot is a pretty bold move, however I can definitely see that there is a case to have an option for this... However please, note that there was also a problem with AOF leading to AOF persistence problems on reboots, usually the master will have all the data if it's not an hard reboot, like power outage. However this was not the case because of a bug, so even gentle shutdowns created an issue with AOF data written in the latest second. Btw I'm planning to add such a feature regardless of the fact that the AOF problem should be gone now in latest 4.0.x, 3.2.x.

Comment From: antirez

Btw @nikolayspb 4.0.1 still had that bug. 4.0.2 is the first one with the fix...

Comment From: nikolayk812

Ok, I will use 4.0.2. I don't use AOF/RDB because Redis would use unexpected amount of memory because of spawned child process. So it either hangs VM if swap enabled or gets killed by OOM killer.

Comment From: kaarelk

I would assume that slave has more recent data in memory than master has on disc, as disc flushes are done with an interval (usually 1sec), but data is pushed to slave immediately.

Comment From: antirez

@kaarelk we had this exact conversation a few minutes ago in an internal chat about this issue. Things are a bit more complex, copying and pasting what I wrote:

Only in a very odd condition of an immediate power outage basically, because while fsync() happens every second, the write to flush data on disk should happen continuously. [This is the moment when after writing a sentence you go to check the code... wait a minute please :-)]

Ok I said the truth :-) Basically the function flushAppendOnlyFile() is called in the beforeSleep() function, so every time the event loop returns, after serving the clients that were already ready to serve, we write the data on disk. However writing just as in write(2) call, the flushing only happens every second. So if Redis is rebooted in any way but sudden kernel crash or power outage, the master should have all the data. Examples: Redis crashes, Redis is manually SHUTDOWN, the kernel kills Redis for OOM, the Virtual Machine or computer host is cleanly shutdown and restarted by the provider. For this reason to failover on reboot by default looks a bit too harsh, because we detect reboot when the master is already available, so in the first instance the non-availability time configured was longer, now the master is working again, why to trigger a failover at this point? Only reason is that it may have less data but it may very well have more now that AOF is fixed :-D

Comment From: nikolayk812

So If I don't use AOF/RDB I would still lose the whole data? Master would enforce slaves to erase all entries.

Comment From: antirez

@nikolayspb yes, a master configured to return back without data is dangerous. In the documentation, what we do is to advise about to configure persistence-less setups so that on restart the machine does not automatically restarts the Redis instance, so that instances cannot recover from reboot. Note that this feature in Sentinel would not avoid this problem of persistence-less masters: the slaves will connect back and will empty. There is a proposed feature called "protected mode" in order to address this problem. It consists of an explicit Redis feature that avoids masters to be available on restarts, but to "unblock" only once they are reconfigured as slaves of some other master, in this way a restart will always trigger a failover. However this feature was conceived to work in certain ways but now probably there are better implementations of the same idea with PSYNC. About RDB persistence that you mentioned, in that case, yep, it could be ways better to promote a slave if the master reboots, but right now the problem is always the same as persistence-less, but just at a lesser degree of severity: slaves will reconnect once the master is up again. This can be avoided with a delay in the startup script of the Redis instances so that the Sentinel failover time is smaller and triggers before, or could be another good use case for "protected mode". In practical terms, in order to provide much more reliability to no-persistence or RDB-persistence HA setups, we could just:

Add a PSYNC2 feature that does not allow to replicate from masters that have a replication ID which is an unknown history or is backward compared to the current history. This is a simple addition.
Also implement the feature proposed here of detecting a restart as a permanent failure, in case of masters.

With such additions slaves would fail to connect to restarted masters unless they have all the data needed (but in the specific case of a master just turned into slave, so without prior master history knowledge), and masters would be failed over on restart even if they restarted fast enough to not even trigger the Sentinel down-after-milliseconds limit.

Comment From: laszlovl

@antirez Wouldn't the warning about automatic restarts (on https://redis.io/topics/replication, under "Safety of replication when master has persistence turned off") even apply with persistence enabled?

If the Redis master would be shut down gracefully and come back immediately (before Sentinel elected a new master) there's no problem, since it updated its dump.rdb right before shutting down.

But if the master were to crash (or OOM killed) and come back immediately, it would return with stale data from a dump.rdb that might be minutes or hours old, depending on your save settings. Forcing the cluster to elect a new master would be much safer, since the other nodes will most likely have the most recent data synced to them already?

Comment From: ben-xo

We're hitting this issue constantly in Kubernetes at the moment.

OOM kill on a redis pod (due to caching a high write-load whilst doing a large BGSAVE, so several Gb of copy-on-write) leads to the master going down, and then instantly respawns.
sentinel detects this as +reboot.
master is not writable as it loads 30Gb of data from disk. This takes 12 minutes.
the two good replicas sit there doing nothing.

Obviously in this case 12 minutes constitutes a large outage and I'd much prefer it to fail over to a replica within a few seconds.

Comment From: yossigo

@ben-xo Do you have some logs? Do you use resolve-hostnames? If so, it might be related to #8540.

Comment From: ben-xo

No, not using hostnames (v6.0.6 I think). The behaviour is exactly as described above; master comes back so quickly that it's not triggering a failover. However, it then takes such a long time to load its data, and is in read only mode for that period, that we effectively have an outage.

Comment From: yossigo

@ben-xo Thanks for clarifying this, so it's definitely a different issue.

Comment From: ben-xo

We are considering altering our redis pod startup script to introduce an artificial startup delay in order to work around the issue, as suggested above by @antirez. It doesn't seem as if we'll get the data to load any faster, as it's backed by GKE PVs which are virtually attached and therefore bandwidth limited.

Comment From: synclpz

Weird that this has no built-in solution (protected mode as said by @antirez) since 2017. Persistence-less container with master restarts in milliseconds, becomes master and effectively leads to data wipe on all replicas before even detected as down/rebooted from sentinel.

Comment From: daniel-house

This is still very much a problem today, and is directly related to this line in the code: https://github.com/redis/redis/blob/492d8d09613cff88f15dcef98732392b8d509eb1/src/sentinel.c#L2789

An easy fix is to add a configuration option that causes the sentinel to interpret LOADING as unavailable at this line. Another easy fix would be to add an option that causes the master to respond using a different error when loading immediately after starting. Either of these simple changes would cause a failover to a replica.

At the root, this issue is the same as https://github.com/redis/redis/issues/4561

Comment From: yossigo

Fixed by #9438