Hi there!
I've been testing our application against failure scenarios and discovered something that I'm not sure if it's me not configuring the setup correctly or if this is Redis' expected behavior.
I made a script that reads a key, increments it, then writes it back once a second to illustrate it. Then I trigger a failover with SENTINEL failover <name>. Here are the logs: https://gist.github.com/Draiken/3d9b40c8d92793448064f077c3193d3b
Redis server config: https://gist.github.com/Draiken/cec04c08323aa46404a099ada165d774
Redis sentinel config: https://gist.github.com/Draiken/05d317c4da5ee861940865cdc4e4830f
Here's the output of my script:
Sending SET 89
Sending SET 90
Sending SET 91
Sending SET 81
Got READONLY error. Switching masters
Sending SET 81
Sending SET 82
The key rolled back to a value 10 seconds earlier without giving any errors to the client. The master Redis accepts writes even after the sentinels know he's been demoted to a slave and when it finally converts itself to a replica, it loses all writes performed since the failover started.
If the master node is unresponsive, the failover works fine as the master can't accept the writes. So this only happens with a manually triggered failover.
I could not find any documentation around this. My question is: is this expected behavior or is this a bug?
Comment From: antirez
Hello @Draiken, this is a known issue, Redis Cluster is able to handle this well while Redis Sentinel manual failover is just to pretend the master is failing: they are very different features, however Redis Sentinel should be like Redis Cluster in that regard. There is another issue here documenting all this better (I don't remember the issue number). Basically Sentinel should use CLIENT PAUSE and check the offset like Redis Cluster does to guarantee no data loss during failovers.
Comment From: antirez
P.S. this is a duplicate, there is to find the original issue and reference it here instead of continuing the discussion here.
Comment From: itamarhaber
Off the top of my head: #4819
Comment From: Draiken
May I suggest we document this a bit more prominently on the Sentinel docs while a solution is being worked on? It's very likely people get affected by this and don't even know, since this doesn't raise any errors to the clients.