Describe the bug Look at the following log. I don't know why such a strange nodeID is generated.

172:S 08 Sep 2021 21:59:52.869 * MASTER <-> REPLICA sync: Finished with success 172:S 08 Sep 2021 22:10:16.009 * Marking node f82b06fc2e0007cd62d66a as failing (quorum reached). 172:S 08 Sep 2021 22:10:16.009 * Marking node as failing (quorum reached). 172:S 08 Sep 2021 22:17:51.019 * FAIL message received from 0195281d4892b6036155b28d195871270415fac6 about Bh 172:S 08 Sep 2021 22:17:51.019 * FAIL message received from 0195281d4892b6036155b28d195871270415fac6 about Bh 172:S 08 Sep 2021 22:17:51.019 * FAIL message received from 0195281d4892b6036155b28d195871270415fac6 about Bh 172:S 08 Sep 2021 22:17:51.019 * FAIL message received from 0195281d4892b6036155b28d195871270415fac6 about Bh 172:S 08 Sep 2021 22:17:51.019 * FAIL message received from 0195281d4892b6036155b28d195871270415fac6 about Bh 172:S 08 Sep 2021 23:14:26.014 * FAIL message received from 44367bc169f1860f4460467483d929917a083158 about Bh

Version: 6.2.4

I create a cluster of 160 masters and 160 replicas. These strange Nodeids will appear in a moment.

Comment From: madolson

@wonderful1984 Do you have anymore information here?

These strange Nodeids will appear in a moment.

Do you mean that they go away without any type of manual intervention? This very clearly looks like some time of memory corruption, but I'm having a hard time understanding what could be causing it. This can be corrupted in the config file, it could be corrupted on the source node, it could be corrupted by one of the nodes in the cluster which then gossips the information.

Comment From: wonderful1984

It may be that in the k8S operation, I shutdown docker immediately after traversing FORGET in the cluster, which may cause this problem in some cases.

After the FORGET command is executed, how long does it take to shutdown Redis safely?

Comment From: madolson

I wouldn't expect it be related to the K8s shutting down. Calling forget will remove the node from gossiping part of the cluster. If you wanted to be safe, you should call shutdown on the node you are removing before removing the node from the cluster. Otherwise it will continue reaching out and sending message.

Comment From: wonderful1984

This problem happens again

/apps/svr/redis-6.2.4/bin/redis-cli -p 7001 cluster nodes|grep -a fail 3020353553fcbac96217db5d787fe95db2e65e83 :0@0 fail - 1636583875773 0 0 disconnected Bmb6c6a2088db443caa86a9cb3111ed92b :7006@17006 slave,fail - 1636193808796 0 0 disconnected Bj :7003@17003 slave,fail - 1636583875773 0 0 disconnected f2f57482a417b11514b1fb :0@0 fail - 1636266020191 0 0 disconnected Bo56b23ede7b0c2acb87ee97add7df8d38 :7004@17004 master,fail - 1636275378242 0 0 disconnected 7b08e97e3f81c3a4a11e8662933a0a47a25:7006@17006 master,fail - 1636030199046 0 0 connected f2f57482a417b11514b1fb3b3b09760714c50f1c :0@0 fail - 1636275378242 0 0 disconnected Bj4e357f0d684d50892bce947c31493c1e :0@0 fail - 1636275378242 0 0 disconnected Bi2a3161cc8a7f3bd7923489de 40ce9d81:0@0 fail - 1636568919130 0 0 disconnected Bmdfb48e240e448d0be7bac0372d777d28 :0@0 fail - 1636583875557 0 0 disconnected 5fca021154772088bf6278cf949f5561d38cb9c4 8083a451:0@0 fail - 1636568919136 0 0 disconnected Bi6b13746879446ea83cd939a7 :7001@17001 slave,fail - 1636568919136 0 0 disconnected :0@0 fail - 1636020990094 0 0 disconnected

178373:M 11 Nov 2021 02:28:40.081 * FAIL message received from 91c3730471874e9a33b4850c47cd95f9acfc5ec2 about 5fca021154772088bf6278cf949f5561d38cb9c4 178373:M 11 Nov 2021 02:28:40.082 * FAIL message received from 91c3730471874e9a33b4850c47cd95f9acfc5ec2 about 178373:M 11 Nov 2021 02:28:40.082 * FAIL message received from 91c3730471874e9a33b4850c47cd95f9acfc5ec2 about 178373:M 11 Nov 2021 03:36:32.213 * FAIL message received from 37347d0fe75b369778d60d4d5afaf4879af34dc4 about 00f9691e8a6d5bd26505e6c0c4feb43ce9e128d8

pstack log

Thread 11 (Thread 0x2aace1f29700 (LWP 5481)):

0 0x00002aacda63e6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

1 0x000000000049ba55 in bioProcessBackgroundJobs ()

2 0x00002aacda63adc5 in start_thread () from /lib64/libpthread.so.0

3 0x00002aacda94673d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x2aace232a700 (LWP 5482)):

0 0x00002aacda63e6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

1 0x000000000049ba55 in bioProcessBackgroundJobs ()

2 0x00002aacda63adc5 in start_thread () from /lib64/libpthread.so.0

3 0x00002aacda94673d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x2aace272b700 (LWP 5483)):

0 0x00002aacda63e6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

1 0x000000000049ba55 in bioProcessBackgroundJobs ()

2 0x00002aacda63adc5 in start_thread () from /lib64/libpthread.so.0

3 0x00002aacda94673d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x2aace292c700 (LWP 5484)):

0 0x00002aacda6411bd in __lll_lock_wait () from /lib64/libpthread.so.0

1 0x00002aacda63cd02 in _L_lock_791 () from /lib64/libpthread.so.0

2 0x00002aacda63cc08 in pthread_mutex_lock () from /lib64/libpthread.so.0

3 0x0000000000452218 in IOThreadMain ()

4 0x00002aacda63adc5 in start_thread () from /lib64/libpthread.so.0

5 0x00002aacda94673d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x2aace2b2d700 (LWP 5485)):

0 0x00002aacda6411bd in __lll_lock_wait () from /lib64/libpthread.so.0

1 0x00002aacda63cd02 in _L_lock_791 () from /lib64/libpthread.so.0

2 0x00002aacda63cc08 in pthread_mutex_lock () from /lib64/libpthread.so.0

3 0x0000000000452218 in IOThreadMain ()

4 0x00002aacda63adc5 in start_thread () from /lib64/libpthread.so.0

5 0x00002aacda94673d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x2aace2d2e700 (LWP 5486)):

0 0x00002aacda6411bd in __lll_lock_wait () from /lib64/libpthread.so.0

1 0x00002aacda63cd02 in _L_lock_791 () from /lib64/libpthread.so.0

2 0x00002aacda63cc08 in pthread_mutex_lock () from /lib64/libpthread.so.0

3 0x0000000000452218 in IOThreadMain ()

4 0x00002aacda63adc5 in start_thread () from /lib64/libpthread.so.0

5 0x00002aacda94673d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x2aace2f2f700 (LWP 5487)):

0 0x00002aacda63ea82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

1 0x0000000000514e91 in background_thread_entry ()

2 0x00002aacda63adc5 in start_thread () from /lib64/libpthread.so.0

3 0x00002aacda94673d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x2aad0de00700 (LWP 241555)):

0 0x00002aacda63e6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

1 0x0000000000514882 in background_thread_entry ()

2 0x00002aacda63adc5 in start_thread () from /lib64/libpthread.so.0

3 0x00002aacda94673d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x2aad0e600700 (LWP 241556)):

0 0x00002aacda63e6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

1 0x0000000000514882 in background_thread_entry ()

2 0x00002aacda63adc5 in start_thread () from /lib64/libpthread.so.0

3 0x00002aacda94673d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x2aad0ee00700 (LWP 241557)):

0 0x00002aacda63e6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

1 0x0000000000514882 in background_thread_entry ()

2 0x00002aacda63adc5 in start_thread () from /lib64/libpthread.so.0

3 0x00002aacda94673d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2aacd9d31cc0 (LWP 5480)):

0 0x00002aacda93bdfd in poll () from /lib64/libc.so.6

1 0x00002aad0c423d64 in __libc_res_nsend () from /lib64/libresolv.so.2

2 0x00002aad0c421c5e in __libc_res_nquery () from /lib64/libresolv.so.2

3 0x00002aad0c422bf5 in __libc_res_nsearch () from /lib64/libresolv.so.2

4 0x00002aad0c215be3 in _nss_dns_gethostbyname4_r () from /lib64/libnss_dns.so.2

5 0x00002aacda92c378 in gaih_inet () from /lib64/libc.so.6

6 0x00002aacda92fa3d in getaddrinfo () from /lib64/libc.so.6

7 0x0000000000431d9c in anetTcpGenericConnect ()

8 0x00000000004d9183 in connSocketConnect ()

9 0x0000000000490cf4 in clusterCron ()

10 0x000000000043950d in serverCron ()

11 0x000000000043136d in aeProcessEvents ()

12 0x000000000043176d in aeMain ()

13 0x000000000042df99 in main ()

Comment From: wonderful1984

Why the node id is empty.

178373:M 11 Nov 2021 02:28:40.081 * FAIL message received from 91c3730471874e9a33b4850c47cd95f9acfc5ec2 about 5fca021154772088bf6278cf949f5561d38cb9c4 178373:M 11 Nov 2021 02:28:40.082 * FAIL message received from 91c3730471874e9a33b4850c47cd95f9acfc5ec2 about 178373:M 11 Nov 2021 02:28:40.082 * FAIL message received from 91c3730471874e9a33b4850c47cd95f9acfc5ec2 about

node id : 91c3730471874e9a33b4850c47cd95f9acfc5ec2

208922:M 11 Nov 2021 02:28:40.079 * Marking node 5fca021154772088bf6278cf949f5561d38cb9c4 as failing (quorum reached). 208922:M 11 Nov 2021 02:28:40.080 * Marking node as failing (quorum reached). 208922:M 11 Nov 2021 02:28:40.080 * Marking node as failing (quorum reached).

/apps/svr/redis-6.2.4/bin/redis-cli -p 7007 cluster nodes|grep -a 5fca021154772088bf6278cf949f5561d38cb9c4 5fca021154772088bf6278cf949f5561d38cb9c4 8083a451:0@0 fail - 1636568919141 0 0 disconnected

How did 8083A451 come ? Causes a high load on the DNS server.

[bad udp cksum 0xc0cc -> 0x9fbe!] 29578+ AAAA? 8083a451. (26)

Comment From: wonderful1984

Server

redis_version:6.2.4 redis_git_sha1:00000000 redis_git_dirty:0 redis_build_id:c3a4f2649228a210 redis_mode:cluster os:Linux 3.10.0-862.9.1.el7.x86_64 x86_64 arch_bits:64 multiplexing_api:epoll atomicvar_api:c11-builtin gcc_version:8.4.0 process_id:122171 process_supervised:no run_id:404c162b650a0de86b8101613bd44aba09613e47 tcp_port:7007 server_time_usec:1636708873956401 uptime_in_seconds:853254 uptime_in_days:9 hz:10 configured_hz:10 lru_clock:9318921