Redis Issue with Partial Resynchronization(PSYNC) feature supported by Redis v4

We found one issue with Partial Resynchronization(PSYNC) feature supported by Redis v4.

We investigated into possible scenarios when redis is able to perform PSYNC. In the case of update of our 3-node setup, we came across a frequently occurring scenario in which PSYNC is possible but seems that this scenario is not taken care of in redis v4.0.2 code base.

Consider a redis 3-node setup with master node A and slave nodes B and C. Nodes have redis and sentinel process running. When node A goes down for update, due to failover Node B will become new master. Node B, since it is promoted to master generates a new replication ID and caches the Id of previous master(Node A) as secondary replication Id. Node C is able to PSYNC with new master, as

Secondary replication ID on new master (Node B) = Primary replication Id of Node C

Now, when Node A gets started after update, it comes up as slave. Due to change in role, this node gets a new replication Id and offset as 1. It is seen that it does not cache its previous replication Id and triggers a partial sync with a completely new replication ID and offset as 1. Since it requests with new replication ID and offset as 1 PSYNC is rejected.

Possible solution to enable slave (previously master) to PSYNC with new master:

When a node which was previously a master, comes up as slave after process restart. It should retain its previous replication Id as secondary Id. It should then send PSYNC request using its secondary Id to the new master. This way, it will be able to perform PSYNC with new master.

Eg:

3-Node setup Node A, B,C :

Node A is master node and all three nodes have same replication Id (as that of master): 85e4a324c39c7efef4659e5f82b22149fa432c922 and secondary replication Id as 0000000000000000000000000000000000000

When Node A goes down as part of update process, Node B becomes master it sets its secondary replication Id to 85e4a324c39c7efef4659e5f82b22149fa432c92 (replication Id of previous master Node A) and generates a new primary Id: d519b09ea2a091a31bfee21ae2047c25af7e7426.

Logs of Node A when it is restarted, comes up as slave and send request for partial sync to the new master:

10279:C 01 Aug 10:07:43.606 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 10279:C 01 Aug 10:07:43.606 # Redis version=4.0.2, bits=64, commit=00000000, modified=0, pid=10279, just started 10280:M 01 Aug 10:07:43.610 * DB loaded from disk: 0.000 seconds 10280:M 01 Aug 10:07:43.610 * Ready to accept connections 10280:S 01 Aug 10:07:54.343 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer. 10280:S 01 Aug 10:07:54.344 * SLAVE OF 10.11.23.98:6379 enabled (user request from 'id=2 addr=10.11.23.97:46430 fd=8 name=sentinel-50f63423-cmd age=10 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=r cmd=exec') 10280:S 01 Aug 10:07:54.344 # CONFIG REWRITE executed with success. 10280:S 01 Aug 10:07:54.639 * Connecting to MASTER 10.11.23.98:6379 10280:S 01 Aug 10:07:54.639 * MASTER <-> SLAVE sync started 10280:S 01 Aug 10:07:54.640 * Non blocking connect for SYNC fired the event. 10280:S 01 Aug 10:07:54.640 * Master replied to PING, replication can continue... 10280:S 01 Aug 10:07:54.641 * Trying a partial resynchronization (request c387ccf64af7583739b7489a2ca68f1641be45b2:1). 10280:S 01 Aug 10:07:54.642 * Full resync from master: d519b09ea2a091a31bfee21ae2047c25af7e7426:45727 10280:S 01 Aug 10:07:54.642 * Discarding previously cached master state. 10280:S 01 Aug 10:07:54.670 * MASTER <-> SLAVE sync: receiving 198 bytes from master 10280:S 01 Aug 10:07:54.670 * MASTER <-> SLAVE sync: Flushing old data 10280:S 01 Aug 10:07:54.670 * MASTER <-> SLAVE sync: Loading DB in memory 10280:S 01 Aug 10:07:54.670 * MASTER <-> SLAVE sync: Finished with success

We understand that since a master node is demoted to slave, it will begin a new data history thus it is necessary to generate a new replication Id. But, if it can maintain its secondary Id as its previous replication id and send request for PSYNC from its secondary Id then PSYNC would be possible.

When Similar scenario occurs on slave node and there is no role change after restart, it retains its previous replication Id and performs PSYNC using that Id.

Logs of slave process able to perform PSYNC even after restart:

7181:S 01 Aug 10:47:56.664 * DB saved on disk .
7181:S 01 Aug 10:47:57.825 # User requested shutdown... 7181:S 01 Aug 10:47:57.825 * Removing the pid file. 7181:S 01 Aug 10:47:57.825 # Redis is now ready to exit, bye bye... 10695:C 01 Aug 10:49:21.233 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 10695:C 01 Aug 10:49:21.233 # Redis version=4.0.2, bits=64, commit=00000000, modified=0, pid=10695, just started 10696:S 01 Aug 10:49:21.238 * DB loaded from disk: 0.000 seconds 10696:S 01 Aug 10:49:21.238 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer. 10696:S 01 Aug 10:49:21.238 * Ready to accept connections 10696:S 01 Aug 10:49:21.238 * Connecting to MASTER 10.11.23.98:6379 10696:S 01 Aug 10:49:21.238 * MASTER <-> SLAVE sync started 10696:S 01 Aug 10:49:21.238 * Non blocking connect for SYNC fired the event. 10696:S 01 Aug 10:49:21.238 * Master replied to PING, replication can continue... 10696:S 01 Aug 10:49:21.239 * Trying a partial resynchronization (request d519b09ea2a091a31bfee21ae2047c25af7e7426:527926). 10696:S 01 Aug 10:49:21.239 * Successful partial resynchronization with master. 10696:S 01 Aug 10:49:21.240 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization

Can someone please check if this scenario of a master node coming up as slave node after restart and still able to perform partial sync with new master, forms a valid use-case for partial sync feature. As this will reduce our downtime significantly during update of setup having large data.

Comment From: antirez

Sorry for the brief reply, I'm in vacation. I wanted just to tell you that in the course of Redis 4 we fixed tens of bugs related to your issues, so please test with latest 4.0.x and check if the issue is still applicable. There are very good chances that now all works as expected.

Il gio 9 ago 2018, 10:47 ankita0811 notifications@github.com ha scritto:

We found one issue with Partial Resynchronization(PSYNC) feature supported by Redis v4.

We investigated into possible scenarios when redis is able to perform PSYNC. In the case of update of our 3-node setup, we came across a frequently occurring scenario in which PSYNC is possible but seems that this scenario is not taken care of in redis v4.0.2 code base.

Consider a redis 3-node setup with master node A and slave nodes B and C. Nodes have redis and sentinel process running. When node A goes down for update, due to failover Node B will become new master. Node B, since it is promoted to master generates a new replication ID and caches the Id of previous master(Node A) as secondary replication Id. Node C is able to PSYNC with new master, as

Secondary replication ID on new master (Node B) = Primary replication Id of Node C

Now, when Node A gets started after update, it comes up as slave. Due to change in role, this node gets a new replication Id and offset as 1. It is seen that it does not cache its previous replication Id and triggers a partial sync with a completely new replication ID and offset as 1. Since it requests with new replication ID and offset as 1 PSYNC is rejected.

Possible solution to enable slave (previously master) to PSYNC with new master:

When a node which was previously a master, comes up as slave after process restart. It should retain its previous replication Id as secondary Id. It should then send PSYNC request using its secondary Id to the new master. This way, it will be able to perform PSYNC with new master.

Eg:

3-Node setup Node A, B,C :

Node A is master node and all three nodes have same replication Id (as that of master): 85e4a324c39c7efef4659e5f82b22149fa432c922 and secondary replication Id as 0000000000000000000000000000000000000

When Node A goes down as part of update process, Node B becomes master it sets its secondary replication Id to 85e4a324c39c7efef4659e5f82b22149fa432c92 (replication Id of previous master Node A) and generates a new primary Id: d519b09ea2a091a31bfee21ae2047c25af7e7426.

Logs of Node A when it is restarted, comes up as slave and send request for partial sync to the new master:

10279:C 01 Aug 10:07:43.606 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 10279:C 01 Aug 10:07:43.606 # Redis version=4.0.2, bits=64, commit=00000000, modified=0, pid=10279, just started 10280:M 01 Aug 10:07:43.610 * DB loaded from disk: 0.000 seconds 10280:M 01 Aug 10:07:43.610 * Ready to accept connections 10280:S 01 Aug 10:07:54.343 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer. 10280:S 01 Aug 10:07:54.344 * SLAVE OF 10.11.23.98:6379 http://10.11.23.98:6379 enabled (user request from 'id=2 addr=10.11.23.97:46430 http://10.11.23.97:46430 fd=8 name=sentinel-50f63423-cmd age=10 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=r cmd=exec') 10280:S 01 Aug 10:07:54.344 # CONFIG REWRITE executed with success. 10280:S 01 Aug 10:07:54.639 * Connecting to MASTER 10.11.23.98:6379 http://10.11.23.98:6379 10280:S 01 Aug 10:07:54.639 * MASTER <-> SLAVE sync started 10280:S 01 Aug 10:07:54.640 * Non blocking connect for SYNC fired the event. 10280:S 01 Aug 10:07:54.640 * Master replied to PING, replication can continue... 10280:S 01 Aug 10:07:54.641 * Trying a partial resynchronization (request c387ccf64af7583739b7489a2ca68f1641be45b2:1). 10280:S 01 Aug 10:07:54.642 * Full resync from master: d519b09ea2a091a31bfee21ae2047c25af7e7426:45727 10280:S 01 Aug 10:07:54.642 * Discarding previously cached master state. 10280:S 01 Aug 10:07:54.670 * MASTER <-> SLAVE sync: receiving 198 bytes from master 10280:S 01 Aug 10:07:54.670 * MASTER <-> SLAVE sync: Flushing old data 10280:S 01 Aug 10:07:54.670 * MASTER <-> SLAVE sync: Loading DB in memory 10280:S 01 Aug 10:07:54.670 * MASTER <-> SLAVE sync: Finished with success

We understand that since a master node is demoted to slave, it will begin a new data history thus it is necessary to generate a new replication Id. But, if it can maintain its secondary Id as its previous replication id and send request for PSYNC from its secondary Id then PSYNC would be possible.

When Similar scenario occurs on slave node and there is no role change after restart, it retains its previous replication Id and performs PSYNC using that Id.

Logs of slave process able to perform PSYNC even after restart:

7181:S 01 Aug 10:47:56.664 * DB saved on disk . 7181:S 01 Aug 10:47:57.825 # User requested shutdown... 7181:S 01 Aug 10:47:57.825 * Removing the pid file. 7181:S 01 Aug 10:47:57.825 # Redis is now ready to exit, bye bye... 10695:C 01 Aug 10:49:21.233 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 10695:C 01 Aug 10:49:21.233 # Redis version=4.0.2, bits=64, commit=00000000, modified=0, pid=10695, just started 10696:S 01 Aug 10:49:21.238 * DB loaded from disk: 0.000 seconds 10696:S 01 Aug 10:49:21.238 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer. 10696:S 01 Aug 10:49:21.238 * Ready to accept connections 10696:S 01 Aug 10:49:21.238 * Connecting to MASTER 10.11.23.98:6379 http://10.11.23.98:6379 10696:S 01 Aug 10:49:21.238 * MASTER <-> SLAVE sync started 10696:S 01 Aug 10:49:21.238 * Non blocking connect for SYNC fired the event. 10696:S 01 Aug 10:49:21.238 * Master replied to PING, replication can continue... 10696:S 01 Aug 10:49:21.239 * Trying a partial resynchronization (request d519b09ea2a091a31bfee21ae2047c25af7e7426:527926). 10696:S 01 Aug 10:49:21.239 * Successful partial resynchronization with master. 10696:S 01 Aug 10:49:21.240 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization

Can someone please check if this scenario of a master node coming up as slave node after restart and still able to perform partial sync with new master, forms a valid use-case for partial sync feature. As this will reduce our downtime significantly during update of setup having large data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/antirez/redis/issues/5228, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEAYFMVIH8-Y6ZyErNt5TVxohJ9Ia9Hks5uO_c3gaJpZM4V1Sh9 .

Comment From: ankita0811

@antirez We evaluated redis latest release 4.0.11 and found that same issue persists in this version too.

Summarising scenario: A master node comes up as slave after node update and restart , It should be able to perform partial sync with new master. Currently, it has to perform full sync with new master.

Comment From: ankita0811

Hi @antirez , Can you please look into this use-case for partial synchronization. As this will reduce our downtime during periodic updates.

Comment From: zouzc

4.0.14 also found that same issue persists.

Comment From: purpletsy

@antirez 5.0.7 also found the issue,Our cluster composed of 6 nodes ,choose rdb not aof as its backup.When i shutdown 1 master node manually and start it again ,I found full Resynchronization not partial Resynchronization from slave(old master) to master.

Logs are below:

master node:

14057:M 27 Feb 2020 14:36:51.095 * Clear FAIL state for node f01e6ad58e43a36652ae73aa1c6474e6add4efce: master without slots is reachable again. 14057:M 27 Feb 2020 14:36:52.093 * Replica 172.31.1.204:6379 asks for synchronization 14057:M 27 Feb 2020 14:36:52.093 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '9cc68990cc49b683592e4855cc17b28c90c4e745', my replication IDs are '3ccfadfd0b68aee269d5a25fc83325c66b469bb0' and '4c568d8cf2f6952d472fe7aa8cbc3dcaa589d135') 14057:M 27 Feb 2020 14:36:52.093 * Starting BGSAVE for SYNC with target: disk 14057:M 27 Feb 2020 14:36:52.094 * Background saving started by pid 14759 14759:C 27 Feb 2020 14:36:52.099 * DB saved on disk 14759:C 27 Feb 2020 14:36:52.100 * RDB: 0 MB of memory used by copy-on-write 14057:M 27 Feb 2020 14:36:52.196 * Background saving terminated with success 14057:M 27 Feb 2020 14:36:52.196 * Synchronization with replica 172.31.1.204:6379 succeeded

slave node (old master node):

22001:M 27 Feb 2020 14:36:51.088 # Server initialized 22001:M 27 Feb 2020 14:36:51.088 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 22001:M 27 Feb 2020 14:36:51.089 * DB loaded from disk: 0.000 seconds 22001:M 27 Feb 2020 14:36:51.089 * Ready to accept connections 22001:M 27 Feb 2020 14:36:51.094 # Configuration change detected. Reconfiguring myself as a replica of 84337cb6eee10e19d466280d6c60e49974857ee3 22001:S 27 Feb 2020 14:36:51.094 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer. 22001:S 27 Feb 2020 14:36:51.095 # Cluster state changed: ok 22001:S 27 Feb 2020 14:36:52.094 * Connecting to MASTER 172.31.1.201:6379 22001:S 27 Feb 2020 14:36:52.095 * MASTER <-> REPLICA sync started 22001:S 27 Feb 2020 14:36:52.095 * Non blocking connect for SYNC fired the event. 22001:S 27 Feb 2020 14:36:52.097 * Master replied to PING, replication can continue... 22001:S 27 Feb 2020 14:36:52.099 * Trying a partial resynchronization (request 9cc68990cc49b683592e4855cc17b28c90c4e745:1). 22001:S 27 Feb 2020 14:36:52.101 * Full resync from master: 3ccfadfd0b68aee269d5a25fc83325c66b469bb0:18172 22001:S 27 Feb 2020 14:36:52.102 * Discarding previously cached master state. 22001:S 27 Feb 2020 14:36:52.203 * MASTER <-> REPLICA sync: receiving 194 bytes from master 22001:S 27 Feb 2020 14:36:52.204 * MASTER <-> REPLICA sync: Flushing old data 22001:S 27 Feb 2020 14:36:52.204 * MASTER <-> REPLICA sync: Loading DB in memory 22001:S 27 Feb 2020 14:36:52.205 * MASTER <-> REPLICA sync: Finished with success

Comment From: mgenov

The situations seems similar and with Redis 6.2.6 (RDB + AOF) with total of 6 nodes (3 master and 3 slaves) and when one of the nodes goes down.

:M 09 Jan 2022 18:50:23.108 * Reading RDB preamble from AOF file...
1:M 09 Jan 2022 18:50:23.108 * Loading RDB produced by version 6.2.6
1:M 09 Jan 2022 18:50:23.108 * RDB age 3989692 seconds
1:M 09 Jan 2022 18:50:23.108 * RDB memory usage when created 2.68 Mb
1:M 09 Jan 2022 18:50:23.108 * RDB has an AOF tail
1:M 09 Jan 2022 18:50:23.111 # Done loading RDB, keys loaded: 5, keys expired: 0.
1:M 09 Jan 2022 18:50:23.111 * Reading the remaining AOF tail...
1:M 09 Jan 2022 18:56:05.383 # <search> Skip background reindex scan, redis version contains loaded event.
1:M 09 Jan 2022 18:56:05.383 * DB loaded from append only file: 342.297 seconds
1:M 09 Jan 2022 18:56:05.384 * Ready to accept connections
1:M 09 Jan 2022 18:56:05.396 * 10 changes in 300 seconds. Saving...
1:M 09 Jan 2022 18:56:05.422 * Background saving started by pid 1447
1:M 09 Jan 2022 18:56:05.605 * Replica 10.72.2.8:6379 asks for synchronization
1:M 09 Jan 2022 18:56:05.605 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'ccc13ca1aceda44f21fc99456dd5fc393ad70fe3', my replication IDs are '6ac5304a744c7693b69e8b97d7574871dcefc0dc' and '0000000000000000000000000000000000000000')
1:M 09 Jan 2022 18:56:05.606 * Replication backlog created, my new replication IDs are '1199ad39c4fb6cb2a4cdfbfdb0dbc3de15c5d18b' and '0000000000000000000000000000000000000000'
1:M 09 Jan 2022 18:56:05.606 * Can't attach the replica to the current BGSAVE. Waiting for next BGSAVE for SYNC

Comment From: garry-t

7.2.4 https://github.com/redis/redis/issues/13483