Redis a node is marked fail with connected state

I deployed a redis cluster about half a year ago. 50 nodes on 25 physical machines, 50 nodes are all master nodes(no replication) and no rdb files are used. One of 25 physical machines was re-booted due to system failure last day. After the system recovering, redis instances on it got up normally and one of the 50 cluster nodes got into strange state: the result cluster nodes is like this

ba94f2cd60995ed430ff7b4785c53ce9672e85da 10.203.3.242:6379@16379 master - 0 1605515275538 725 connected 5886-6212 16368 88c8c310ba41d89f8714dffd2f5f540cfb8fbcf3 10.203.42.106:6380@16380 master - 0 1605515276336 747 connected 13080-13406 a9747ab184019d7df125d535ab754a772159a638 10.132.101.199:6380@16380 master - 0 1605515274000 746 connected 12426-12752 367e662653e6651b0f9f8b2b9086d3abbb4b1cd4 10.132.103.18:6380@16380 master - 0 1605515277336 758 connected 15696-16022 4fb60249208171883007711ef03ab968ef7c8914 10.203.45.74:6379@16379 master - 0 1605515274034 754 connected 6213-6539 16369 8626a21e94888a7f2a48e18f30c1bff1f5c22141 10.132.103.140:6379@16379 master - 0 1605515275000 718 connected 1635-1961 16355 53f0dfc62f12ada581af93144813659497d755d8 10.132.68.210:6380@16380 master - 0 1605515274534 741 connected 10137-10463 16381 acaccbcb9def926fc97af02aa11e0f065973e2c4 10.203.27.88:6380@16380 master - 0 1605515274534 752 connected 15042-15368 16352 f4fc6c952c02e3cc30ed1870fdc1f2896b3c9378 10.203.3.242:6380@16380 master - 0 1605515277035 750 connected 14061-14387 382fee2c306cba55898b229a9442139f75443a12 10.132.102.83:6379@16379 master - 0 1605515276000 739 connected 2616-2942 16358 5bce3198d3c5996aac761540c374d36d66672a96 10.203.51.169:6380@16380 master - 0 1605515273533 759 connected 8502-8828 16376 1562420edc16a6e802aeb86456aaf38111eeac60 10.132.77.19:6379@16379 master - 0 1605515275000 761 connected 7194-7520 16372 52a674ec9292dc73f9c5e540b40b24b2d2550510 10.203.27.87:6380@16380 master - 0 1605515274000 763 connected 8829-9155 16377 5c6914cdacaf06c7f407f5887d25c9f1f87a6621 10.132.69.201:6380@16380 master - 0 1605515278537 764 connected 3924-4250 16362 a27da3de1f1365389d4c91aa263f923e7f2324d3 10.203.3.243:6379@16379 master - 0 1605515275000 734 connected 0-326 16350 ff60f8eb009c594b8254d7f3cb9509c1b89b2a25 10.132.101.199:6379@16379 master - 0 1605515277000 721 connected 4251-4577 16363 6d4623931fab74cd4a405bf2dd5eeaf3bbe0e40d 10.132.77.81:6379@16379 master - 0 1605515274000 736 connected 5232-5558 16366 db98efe22e10edac461c3b7a2f7a11f611df7f6e 10.203.45.74:6380@16380 master - 0 1605515274000 751 connected 14388-14714 3a8bd3c2701445727c90280805dbac229c2ab463 10.203.3.243:6380@16380 master - 0 1605515272000 737 connected 8175-8501 16375 58a2bf2713d1520174a29de856aec4e1aea5a114 10.132.102.199:6380@16380 master - 0 1605515277000 742 connected 11118-11444 28aea9aa261f724f9dd3461e2084b75ed60d21df 10.203.45.107:6379@16379 master - 0 1605515277537 719 connected 2289-2615 16357 ba6dc3929ad6b575c091108564d237fa2f765168 10.132.69.139:6379@16379 master,fail - 1605495090788 1605495090788 733 connected 3597-3923 16361 71a8d2b55b8f07cd95347f11a7eecdc4013510e5 10.132.69.202:6379@16379 master - 0 1605515278037 717 connected 1308-1634 16354 032c386bd81b6e88af94a80aa9b1e6276fde90f6 10.132.68.210:6379@16379 master - 0 1605515275000 762 connected 1962-2288 16356 4b6f7cd3169168d8f86ef6ef56626039c608fd39 10.203.27.87:6379@16379 master - 0 1605515277000 730 connected 7521-7847 16373 e4d2cc463fe7efc5de8771d8942f241113c0e974 10.132.69.23:6379@16379 master - 0 1605515275000 724 connected 5559-5885 16367 497b9c77e216c9cd70ebe26d9a6676cff6f383ab 10.132.69.201:6379@16379 master - 0 1605515275333 732 connected 12099-12425 c0e7f436197bb815ca19e663cd35908830cebc06 10.132.103.18:6379@16379 master - 0 1605515278537 731 connected 654-980 bc9084c94cb3b65deb0f7683675b3bc5960e1ee3 10.203.51.169:6379@16379 master - 0 1605515276000 716 connected 327-653 16351 9fd7de52343728a571c06ad933bdcc8c7f4282e1 10.132.77.81:6380@16380 master - 0 1605515278537 748 connected 13407-13733 509479b987cfb0ed1f6fcbfc0b7629bbb109b8ee 10.203.42.106:6379@16379 master - 0 1605515276536 723 connected 4905-5231 16365 243563bca9524b8f1ac38fa785356f521dcc9841 10.132.77.19:6380@16380 master - 0 1605515272000 756 connected 15369-15695 1d183e5fbe7be13a61cb01a9f232b998e1b91391 10.132.67.87:6379@16379 master - 0 1605515275000 735 connected 981-1307 16353 083fdbdb78bf7e14e2312437b00b86ef07df718f 10.203.45.107:6380@16380 master - 0 1605515272000 745 connected 10464-10790 16382 fa309048ddbb0d25f810c84fb4f3bdc3b8aa29d7 10.203.27.88:6379@16379 master - 0 1605515272533 728 connected 6867-7193 16371 dce49094233d4756ed507e1688d7c26ca863a766 10.132.69.139:6380@16380 master - 0 1605515275000 744 connected 11772-12098 c379d99be6fc25af184556562532223996752e58 10.203.44.150:6379@16379 master - 0 1605515274000 729 connected 7848-8174 16374 1a048d140c6eb856d8458cedce978a9f215862fb 10.132.102.83:6380@16380 master - 0 1605515272533 765 connected 10791-11117 16383 75173d39b4390d9690864f3bc0869cf1eff30918 10.203.44.150:6380@16380 master - 0 1605515278037 753 connected 16023-16349 52adedeb5678f23b39906d0de168246b7647f569 10.203.1.199:6380@16380 master - 0 1605515277000 757 connected 12753-13079 045cf2bea5c5acbe9b84d5ffd14703769ef453be 10.203.42.108:6380@16380 master - 0 1605515277000 760 connected 14715-15041 d5c88671f8fec144761a27ee14fd881613b9a226 10.132.69.88:6379@16379 master - 0 1605515275000 720 connected 3270-3596 16360 bf21be06d2d7e00a1af2af25e6e60a9d3f9157ee 10.203.42.108:6379@16379 master - 0 1605515276000 727 connected 6540-6866 16370 78f7680c2b112f514958770928e5644ba6a9fb29 10.132.103.140:6380@16380 master - 0 1605515275000 740 connected 9810-10136 16380 670afa6067176cf86ec2c4ab851fd3d62b50bc6e 10.203.1.199:6379@16379 master - 0 1605515278538 722 connected 4578-4904 16364 87956a56e294688b0e50fd6fea6259ac547baddc 10.132.102.199:6379@16379 master - 0 1605515275533 726 connected 2943-3269 16359 8fbbb565f6f20a25885a347dbed6f25cf341acbd 10.132.69.23:6380@16380 master - 0 1605515278337 755 connected 13734-14060 e2c0bad0c525e75f9ec43eaab56e4278ffbf5591 10.132.69.88:6380@16380 master - 0 1605515275000 743 connected 11445-11771 ab26ad7a3d805dc18ccccd010f493ae347af696d 10.132.67.87:6380@16380 myself,master - 0 1605515271000 738 connected 9156-9482 16378 1b5494a4460f9b149ec4bb473f7104bcd7a53117 10.132.69.202:6380@16380 master - 0 1605515276000 749 connected 9483-9809 16379

the line in bold shows that a node got connected and is marked fail the newlly started master nodes are 10.132.69.139:6379 and 10.132.69.139:6380 and this strange result occured only when command cluster nodes is requested after connected to 10.132.67.87:6380 I tried to sent get/set/info command to newlly started nodes, they all responed normally with right result.

Now, I want to know that if it is harmless to the cluster what should I do to mark it with right state?

Comment From: madolson

I don't think it is strictly a concern for you, since redirects will still work. I'm going to look into deep diving it though.

Comment From: madolson

We only clear the fail state when the node is responding to us, and for some reason those two nodes aren't talking to each other. If you take a look at the log:

ba6dc3929ad6b575c091108564d237fa2f765168 10.132.69.139:6379@16379 master,fail - 1605495090788 1605495090788 733 connected 3597-3923 16361

Ping sent time -> 1605495090788 Pong response time -> 1605495090788

That time is about 6 hours compared to the next closest node. It also means we sent a ping, and the other node never responded.

Comment From: madolson

My best guess, for right now, is that we both received a pong and sent a ping in the same ms, and that ping got dropped. In that case, we would never detect that dropped ping.

Comment From: zhangwei17

Sorry for long time no response and THANKS for your analyses. It seems that the strange node sent the ping and waited for the pong for ever? What should I do to make it sending a new ping so that it can get the right detection result?

Comment From: madolson

Thanks for following up about this, I forgot to circle back. I think the gap is here: https://github.com/redis/redis/blob/7de6451818175c41ed5cda5d54d7cb9ebb1a81ad/src/cluster.c#L3633. This is the detector for stale connections, but it didn't work since it failed to detect the connection was hanging. So probably node->pong_received <= node->ping_sent && /* still waiting pong */ would have caught this dropped packet and killed the connection. I can submit a fast PR tomorrow

Comment From: trevor211

Thanks for following up about this, I forgot to circle back. I think the gap is here: https://github.com/redis/redis/blob/unstable/src/cluster.c#L3641. This is the detector for stale connections, but it didn't work since it failed to detect the connection was hanging. So probably node->pong_received <= node->ping_sent && /* still waiting pong */ would have caught this dropped packet and killed the connection. I can submit a fast PR tomorrow

Maybe you can expand the URL to its canonical form in case the code may change in the future.