Redis Master fails to become a slave after restart

I'm running redis cluster 5.0.7 of 216 nodes (108 masters). And I observe that sometimes old master fails to become a replica of its slave after being offline for some time.

In the logs I see messages indicating that node identified that it needs to become a replica, but it stays master of 0 slots according to cluster nodes or redis-cli --cluster check. Here is what I think are related log entries:

I 2020-01-09T23:12:56.622861229Z 1:M 09 Jan 2020 23:12:56.622 * Node configuration loaded, I'm d031745a16b055d158d9b6b56c6a2da0ccc7b567
I 2020-01-09T23:12:56.623263851Z 1:M 09 Jan 2020 23:12:56.623 * Running mode=cluster, port=6379.
I 2020-01-09T23:12:56.623411131Z 1:M 09 Jan 2020 23:12:56.623 # Server initialized
I 2020-01-09T23:12:56.623453155Z 1:M 09 Jan 2020 23:12:56.623 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
I 2020-01-09T23:12:56.624491118Z 1:M 09 Jan 2020 23:12:56.624 * DB loaded from disk: 0.001 seconds
I 2020-01-09T23:12:56.624508182Z 1:M 09 Jan 2020 23:12:56.624 * Ready to accept connections
I 2020-01-09T23:12:56.669562061Z 1:M 09 Jan 2020 23:12:56.669 # Address updated for node 8ff429b06dcb6dcdf39a9823df3f7b87ce59a184, now 10.8.3.194:6379
I 2020-01-09T23:12:56.670212013Z 1:M 09 Jan 2020 23:12:56.670 # Configuration change detected. Reconfiguring myself as a replica of 6a14699068ea03d89181b3af5d0b5c8a03d3e9ca
I 2020-01-09T23:12:56.670224129Z 1:S 09 Jan 2020 23:12:56.670 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
I 2020-01-09T23:12:56.670613600Z 1:S 09 Jan 2020 23:12:56.670 # Address updated for node d031745a16b055d158d9b6b56c6a2da0ccc7b567, now 10.8.3.3:6379
I 2020-01-09T23:12:56.670625436Z 1:S 09 Jan 2020 23:12:56.670 # Discarding UPDATE message about myself.
I 2020-01-09T23:12:57.628407779Z 1:S 09 Jan 2020 23:12:57.628 # Address updated for node c4443eb2039a5be6861e959fac7ef1ad9b105ed3, now 10.8.4.66:6379
I 2020-01-09T23:12:57.659487756Z 1:S 09 Jan 2020 23:12:57.659 * Connecting to MASTER 10.8.1.67:6379
I 2020-01-09T23:12:57.659536728Z 1:S 09 Jan 2020 23:12:57.659 * MASTER <-> REPLICA sync started
I 2020-01-09T23:12:57.659764121Z 1:S 09 Jan 2020 23:12:57.659 * Non blocking connect for SYNC fired the event.
I 2020-01-09T23:12:57.660107819Z 1:S 09 Jan 2020 23:12:57.660 * Master replied to PING, replication can continue...
I 2020-01-09T23:12:57.660906098Z 1:S 09 Jan 2020 23:12:57.660 * Trying a partial resynchronization (request 9dcea3c22182070c6a8f86c057a444ee393b5675:1).
I 2020-01-09T23:12:57.661684539Z 1:S 09 Jan 2020 23:12:57.661 * Full resync from master: 6ffb118b20a592e56c0b59fb332709c5d5426700:99064
I 2020-01-09T23:12:57.661700728Z 1:S 09 Jan 2020 23:12:57.661 * Discarding previously cached master state.
I 2020-01-09T23:12:57.727687056Z 1:S 09 Jan 2020 23:12:57.727 # Address updated for node 5a12b3d76a8d4eb594bcec8661f750f1760ce158, now 10.8.4.67:6379
I 2020-01-09T23:12:57.844343292Z 1:S 09 Jan 2020 23:12:57.844 # Address updated for node d5ccd24e60bed5ca898c4bb3a1784174e1e4f1ad, now 10.8.4.68:6379
I 2020-01-09T23:12:57.859815431Z 1:S 09 Jan 2020 23:12:57.859 * MASTER <-> REPLICA sync: receiving 904 bytes from master
I 2020-01-09T23:12:57.859869274Z 1:S 09 Jan 2020 23:12:57.859 * MASTER <-> REPLICA sync: Flushing old data
I 2020-01-09T23:12:57.859952504Z 1:S 09 Jan 2020 23:12:57.859 * MASTER <-> REPLICA sync: Loading DB in memory
I 2020-01-09T23:12:57.859966696Z 1:S 09 Jan 2020 23:12:57.859 * MASTER <-> REPLICA sync: Finished with success
I 2020-01-09T23:12:58.662747113Z 1:S 09 Jan 2020 23:12:58.662 # Cluster state changed: ok

Bunch of messages like 
I 2020-01-09T23:12:59.828828563Z 1:S 09 Jan 2020 23:12:59.828 # Address updated for node 70b782bfd767fe7a9b8de030771faffa9e7700ec, now 10.8.3.195:6379
I 2020-01-09T23:13:00.233214032Z 1:S 09 Jan 2020 23:13:00.233 # Address updated for node 9a405de42d498d3dd4fc329f43e2e88d5c39241d, now 10.8.3.67:6379 
I 2020-01-09T23:13:00.253728326Z 1:S 09 Jan 2020 23:13:00.253 # Address updated for node c3ae46eda35ac232ee3e39bb7e25c600edc3c28d, now 10.8.3.68:6379
and 
I 2020-01-09T23:13:12.839461468Z 1:S 09 Jan 2020 23:13:12.839 * FAIL message received from c4443eb2039a5be6861e959fac7ef1ad9b105ed3 about cf839f0a7e245f77fb97f7a39ad6eb28603d319f
I 2020-01-09T23:13:12.847391628Z 1:S 09 Jan 2020 23:13:12.847 * FAIL message received from c4443eb2039a5be6861e959fac7ef1ad9b105ed3 about 22f15e7233436d532f2a71d4bca58d4da376cb8a
I 2020-01-09T23:13:13.037341679Z 1:S 09 Jan 2020 23:13:13.037 # Address updated for node 25d3c24a94868ce24ecdb16a8263522793ebeb01, now 10.8.3.201:6379
I 2020-01-09T23:13:13.155890956Z 1:S 09 Jan 2020 23:13:13.155 * Clear FAIL state for node 25d3c24a94868ce24ecdb16a8263522793ebeb01: replica is reachable again.
I 2020-01-09T23:13:14.382931651Z 1:S 09 Jan 2020 23:13:14.382 # Address updated for node 8cae7115a1d2778dbbb81ec92959db69b1174463, now 10.8.4.71:6379

in cluster nodes output I see d031745a16b055d158d9b6b56c6a2da0ccc7b567 10.8.3.3:6379@16379 master - 0 1578611660000 262 connected 6a14699068ea03d89181b3af5d0b5c8a03d3e9ca 10.8.1.67:6379@16379 master - 0 1578611661000 263 connected 13062-13212

cluster check shows [OK] All nodes agree about slots configuration.

I was able to reproduce it several times but not every time. To reproduce I kill several nodes at the same time (36 masters and 36 slaves of some other masters), give cluster some time do do failovers and stabilize. Restore missing nodes, wait for cluster to stabilize. In one of the cases 5 nodes out of 36 became empty masters with 0 slots.

It may or may not be related to some allowed time delays. Couple times I gave cluster about an hour to run with those nodes missing and it came back broken. Another couple times I restored missing nodes within 5 minutes and all old masters became slaves as expected.

Also In my setup after startup node sends cluster meet command to one of the alive nodes. This should be not necessary as we already know other nodes. We do that to handle situations when the entire cluster will be down and all nodes will obtain new ips.

It might be that this post contains not enough info to reproduce, I hope it at least will be helpful to see if other redis users see the same issue. Also, I'm happy to provide more info if necessary.