Describe the bug
redis-cluster-config file seems to be getting some kind of random data into it causing cluster failure. I am thinking some kind of key or value is getting pushed to cluster-config-file.
$ redis-cli -p 6381 cluster nodes
972b2f816c7d72ccb6449b8b0c63dbbf4bfd0ca8 10.137.7.40:6380@16380 slave 381ec2e402c00e0d62acdbeb3255258e60d7e877 0 1695917430000 23 connected
b01a16187efd2f6ce2fb83aff0ae9c9fe5af12da 10.137.7.41:6381@16381 master - 0 1695917432794 26 connected 9830-13106
57d52c56fcda7d882570e747896368e39607074e 10.137.7.42:6380@16380 slave b01a16187efd2f6ce2fb83aff0ae9c9fe5af12da 0 1695917428248 26 connected
fc0e0d36823eb9e1c1b0cf75715288b8278abec3 10.137.7.42:6382@16382 slave 4001c9f978be36ff6cea881f60816ef89f80d707 0 1695917434815 24 connected
der-open-rtl.p :1500@1612 fail - 1695913377873 0 0 disconnected
4001c9f978be36ff6cea881f60816ef89f80d707 10.137.7.41:6380@16380 master - 0 1695917430000 24 connected 3277-6553
a83f2651608ebe737b4e2443edf2da25d724ce34 10.137.7.42:6381@16381 slave,fail cb7dec4880391a902d130975bc54e4b42c2e2ee9 1695889520061 1695889514000 25 disconnected
b21407b2ccef07f85df7af2e1eb0769aa2af4ded 10.137.7.40:6383@16383 slave b01a16187efd2f6ce2fb83aff0ae9c9fe5af12da 0 1695917432000 26 connected
9210ac7ae4195b572e7912b3e4abd71e4916b817 10.137.7.41:6382@16382 slave b01a16187efd2f6ce2fb83aff0ae9c9fe5af12da 0 1695917434000 26 connected
:0@0 fail - 1695666112498 0 0 disconnected
54971a279b052a4c9f52904b1e43196a4ddfc535 10.137.7.42:6379@16379 slave,fail cb7dec4880391a902d130975bc54e4b42c2e2ee9 1695676525848 1695676519796 25 disconnected
2f8b076406756e6 :0@0 fail - 1695675622118 0 0 disconnected
74 :0@0 fail - 1695898003140 0 0 disconnected
40207d183037d46913f1aa34f0648ee9a85fe266 10.137.7.42:6383@16383 slave 70fcb6340d6676ff501054b232239b3d00026180 0 1695917432000 11 connected
b648d2cd8dc6d35f457b5288a296df800c9ee3e8 10.137.7.40:6379@16379 slave 4001c9f978be36ff6cea881f60816ef89f80d707 0 1695917430000 24 connected
70fcb6340d6676ff501054b232239b3d00026180 10.137.7.41:6379@16379 master - 0 1695917431580 11 connected 6554-9829
6e31b547940234936b6ef0163bf39f21159f52b8 :0@0 master,fail,noaddr - 1695532668218 1695532473444 22 disconnected
2f8b076406756e6d518d1ab4ea4e609e48f9caf5 10.137.7.40:6382@16382 slave cb7dec4880391a902d130975bc54e4b42c2e2ee9 0 1695917431782 25 connected
381ec2e402c00e0d62acdbeb3255258e60d7e877 10.137.7.41:6383@16383 master - 0 1695917432000 23 connected 13107-16383
.png
$12
803746208768
$12
drop-yes.p
803746208769
$12
drop-yes.png
$12
80374628e80:12344@14135 master,fail?,fail - 1695916197677 0 0 disconnected
2
803746337032
$12
drop-yes.gif
:0@0 fail - 1695902042811 0 0 disconnected
der-open-rtl.png
$12
803746080771
$19 -open-rtl.png
$12
803746080772
$12
drop-ad:25646@26473 master,fail?,fail,nofailover - 1695907661695 0 0 disconnected
cb7dec4880391a902d130975bc54e4b42c2e2ee9 10.137.7.40:6381@16381 myself,master - 0 1695917430000 25 connected 0-3276
To reproduce
Not sure how those ended up in the configuration . It is only happening in one of our cluster and we have log of clusters with same setup not happening anywhere else.
Expected behavior
Clean node details and epoch details in redis-cluster-config file
Additional information
Any additional information that is relevant to the problem.
Comment From: roshkhatri
Hi, Can you provide which engine version are you using here?
The data seems pretty much explicit like -open-rtl.png, der-open-rtl.png, etc. Is there a script that touches the nodes.conf file?
Comment From: udayjalagam
Hello @roshkhatri , Sorry some reason I missed your comments. We are using redis 6.2.6 version of redis . Yeah right those entries seems like some kind of keys or values . We don't have any scripts or program that directly touch nodes.conf .
Same cluster got into same situation again now though it doesn't have strings like last time it has some random characters now. here is some output bellow with random charecters
9e06a88e2d9c80662ca228c12a60d3f7537c1ddb 10.137.7.40:6380@16380 slave,fail 96533608db6c8f3cc9d85e5f239535e2e6dadac1 1697826966687 1697826960149 52 disconnected c2ae0d44:file:e82c136cfe994 5fc873ba2e0 :0@0 fail - 1697827884749 0 0 disconnected e1e690e243c560a81abc03dd65c92aabc9ad291c 10.137.7.42:6383@16383 slave 381ec2e402c00e0d62acdbeb3255258e60d7e877 0 1698246500000 38 connected 4001c9f978be36ff6cea881f60816ef89f80d707 10.137.7.41:6380@16380 master - 0 1698246500543 27 connected 3278-6553 :0@0 fail - 1697774228891 0 0 disconnected 70fcb6340d6676ff501054b232239b3d00026180 10.137.7.41:6379@16379 master - 0 1698246499532 37 connected 6554-9831
63465f6084092ccc5c853ae1e2f6c3a8cbf56b0d 10.137.7.40:6379@16379 myself,slave 70fcb6340d6676ff501054b232239b3d00026180 0 1698246498000 37 connected
aa4ecf6a691be4b2107b441d121415ea513d1c7d 10.137.7.40:6383@16383 slave 381ec2e402c00e0d62acdbeb3255258e60d7e877 0 1698246498000 38 connected
E4J@@G
)
*PQ/g ;?~XG:0@0 fail - 1697827884749 0 0 disconnected
Comment From: udayjalagam
I also see there seems to be some kind of corruption happening in rbd dump and also breaking replication since replica can't accept some key master has taken . Thinking some data is getting corrupted after writing to master memory else I am not sure how master can get data set and replica can't take them .
At first 3752750:S 20 Oct 2023 05:02:04.567 * MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization. 4136345:C 20 Oct 2023 05:02:13.048 * DB saved on disk 4136345:C 20 Oct 2023 05:02:13.789 * RDB: 93 MB of memory used by copy-on-write 3752750:S 20 Oct 2023 05:02:14.419 * Background saving terminated with success 3752750:S 20 Oct 2023 05:16:00.544 # Connection with master lost. 3752750:S 20 Oct 2023 05:16:00.544 * Caching the disconnected master state. 3752750:S 20 Oct 2023 05:16:00.544 * Reconnecting to MASTER 10.137.7.41:6380 3752750:S 20 Oct 2023 05:16:00.544 * MASTER <-> REPLICA sync started 3752750:S 20 Oct 2023 05:16:00.544 * Non blocking connect for SYNC fired the event. 3752750:S 20 Oct 2023 05:16:00.545 * Master replied to PING, replication can continue... 3752750:S 20 Oct 2023 05:16:00.545 * Trying a partial resynchronization (request 1b81ff5c96fdfed858056d0aedf38cb28f58aaec:34874862577). 3752750:S 20 Oct 2023 05:16:00.545 * Successful partial resynchronization with master. 3752750:S 20 Oct 2023 05:16:00.545 * MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization. 3752750:S 20 Oct 2023 06:00:43.985 # Bad message length or signature received from Cluster bus. 3752750:S 20 Oct 2023 06:00:44.085 # Bad message length or signature received from Cluster bus. 3752750:S 20 Oct 2023 06:00:44.186 # Bad message length or signature received from Cluster bus. 3752750:S 20 Oct 2023 06:00:44.287 # Bad message length or signature received from Cluster bus.
Then before crashing
3752849:S 20 Oct 2023 04:09:33.224 * Loading RDB produced by version 6.2.6 3752849:S 20 Oct 2023 04:09:33.224 * RDB age 309 seconds 3752849:S 20 Oct 2023 04:09:33.224 * RDB memory usage when created 8046.56 Mb 3752849:S 20 Oct 2023 04:12:37.758 # Internal error in RDB reading offset 4903141281, function at rdb.c:1992 -> Hash ziplist integrity check failed. 3752849:S 20 Oct 2023 04:13:31.680 # Terminating server after rdb file reading failure.
Comment From: udayjalagam
Digging little more looks like first set of events is few replicas lost replication with error Disconnecting timedout replica (streaming sync): and the replication lost after many tries from slave finally gave up when it reached state Internal error in RDB reading offset 4903141281, function at rdb.c:1992 -> Hash ziplist integrity check failed.
not sure reason for initial disconnect and what causing rbd corrupted .
3809241:M 20 Oct 2023 02:31:19.646 * Background saving terminated with success
3809241:M 20 Oct 2023 03:30:12.047 # Disconnecting timedout replica (streaming sync): 10.137.7.40:6381
3809241:M 20 Oct 2023 03:30:12.048 # Connection with replica 10.137.7.40:6381 lost.
3809241:M 20 Oct 2023 03:30:12.050 * Replica 10.137.7.40:6381 asks for synchronization
3809241:M 20 Oct 2023 03:30:12.050 * Partial resynchronization request from 10.137.7.40:6381 accepted. Sending 0 bytes of backlog starting from offset 46806371790.
3809241:M 20 Oct 2023 03:31:20.004 * 1000 changes in 3600 seconds. Saving...
3809241:M 20 Oct 2023 03:31:20.460 * Background saving started by pid 14736
14736:C 20 Oct 2023 03:34:09.750 * DB saved on disk
14736:C 20 Oct 2023 03:34:10.057 * RDB: 86 MB of memory used by copy-on-write
3809241:M 20 Oct 2023 03:34:10.430 * Background saving terminated with success
3809241:M 20 Oct 2023 03:40:05.841 # Disconnecting timedout replica (streaming sync): 10.137.7.40:6381
3809241:M 20 Oct 2023 03:40:05.841 # Connection with replica 10.137.7.40:6381 lost.
3809241:M 20 Oct 2023 03:40:05.843 * Replica 10.137.7.40:6381 asks for synchronization
3809241:M 20 Oct 2023 03:40:05.843 * Partial resynchronization request from 10.137.7.40:6381 accepted. Sending 110 bytes of backlog starting from offset 46814169084.