Cluster scale: 512 nodes, one master have three salves connected.
In the cluster one node stay in handshake state and the node has been failed down, so in cluster nodes we can see the node id changed but can not join to cluster.
How to remove this handshake node from the cluster?
Comment From: zhengfc
the node id changed because of you start redis with diffirent dir and your redis.conf is define a relative dir, I think you define a ./ as your directory. So just guarantee start redis in same dir or change redis.conf absolute dir
Comment From: yongman
This happened in case of one node is down then I want to command the cluster to forget it, but the node come to alive again, so it just try to joined to the cluster again and have a handshake status in cluster. At the same time, the node down again so the handshake status can not complete?
Comment From: antirez
Exactly, as @YongMan said, this happens in every case where, basically: 1. A node will try to connect to a cluster, assuming the cluster knows about it 2. But actually: the cluster does not know about it.
This happens both when you CLUSTER FORGET the node, or simply when you reconfigure a whole cluster from scratch, but you don't reconfigure some single node, that will keep trying connecting with the old configuration. You should identify this physical Redis instance and shut it down or configure it correctly using CLUSTER MEET to let it join the cluster properly.
Comment From: yongman
@antirez Yeah, this is happened in production environment, the dead node in which host is down and ip address is unique. That is to say, 1. the dead node is already in handshake status in cluster nodes 2. the node can not be restart to complete handshake progress 3. if a different cluster use the dead node address after machine be repaired, it will lead to a confusion 4. how to fix this status
Comment From: antirez
@YongMan I'm not sure this is your case, because node in handshake state are automatically deleted by Redis Cluster after they don't change state within the node-timeout amount of milliseconds. So in your environment there is likely still a node that advertises itself constantly. You can also use CLUSTER FORGET in order to remove a stale node, but if it appears again, as I said, it's because the node is still alive and is connecting to the other nodes again and again.
TLDR: I believe your hypothesis that your external node is just in memory of the other nodes is likely wrong. Probably it is running and trying to connect to the other nodes constantly.
Comment From: yongman
@antirez The handshake node happened again, and I am sure that the node completely down and every node has configured 'node-timeout' then send 'cluster forget' to all nodes left. But when I use 'cluster nodes' command to other node, it returns nodes in handshake state sometimes. It seems that gossip send the node in handshake across the cluster?
Comment From: yongman
@antirez It is my fault.
This is caused by that when I send cluster forget command to all the other nodes, not all nodes execute success(may be network jitter). So this missing node will send the dead node continuous via gossip.
The solution is to send cluster forget again to the missing node.
I'll close this. Thanks
Comment From: luweijie007
@YongMan How to you fit this problem? I want to use "cluster forget nodeid" to clear nodes info but nodeid is change every second
Comment From: antirez
Node Id changes because the node is still not part of the cluster, just in handshake stage. Shut down the instance that is not part of the cluster but that is trying to connect. On Jan 8, 2016 08:38, "luweijie007" notifications@github.com wrote:
@YongMan https://github.com/YongMan How to you fit this problem? I want to use "cluster forget nodeid" to clear nodes info but nodeid is change everey second
— Reply to this email directly or view it on GitHub https://github.com/antirez/redis/issues/2965#issuecomment-169918569.
Comment From: luweijie007
@antirez my situation is instances of redis have been closed. but this information of node can not been clear. I has try to clear shutdown nods information in nodes.conf , but no work I want to use"cluster forget", but cannot get node Id How can I to clear this shutdown redis node information,
Comment From: antirez
@luweijie007 I think you think that's the case, but I believe you are misreading the situation and there is actually an instance which is not logically part of the cluster, but is trying to connect. Please post the CLUSTER NODES output.
Comment From: luweijie007
@antirez thanks your time. I post 2 times CLUSTER NODES infors to you 1> 10.15.107.179:7001> cluster nodes e820337f04fe2146ca02cdc7eec2cc828534a20e 10.15.107.150:7001 handshake - 1452240292359 0 0 disconnected ac640f7ee5977e5034d8cbc99ec941f6dfed32f2 10.15.107.149:7000 master - 0 1452240292662 73 connected 5461-10922 86f02180767de571744d99114634f241459d531e 10.15.107.179:7001 myself,slave e89c21ed2ba1666b8727260306f38bb9c2d79d84 0 0 67 connected e89c21ed2ba1666b8727260306f38bb9c2d79d84 10.15.107.180:7001 master - 0 1452240293164 75 connected 0-5460 7ba14e2e32d0603bc19e901e93c3fe08a2e81a83 10.15.107.179:7000 master - 0 1452240293164 74 connected 10923-16383 987a68009edcc4c1d3a05b704c9757be79456bdb 10.15.107.150:7000 handshake - 1452240291457 0 0 disconnected 8a536d822b5ae1c4344834f326fd3a9d14999193 10.15.107.180:7000 slave ac640f7ee5977e5034d8cbc99ec941f6dfed32f2 0 1452240293667 73 connected 380f31fb0c6b6422b06190b79220b8540b9e29c6 10.15.107.149:7001 slave 7ba14e2e32d0603bc19e901e93c3fe08a2e81a83 0 1452240294169 74 connected
2>10.15.107.179:7001> cluster nodes 81f1dde416b662c472a1196d956f80fae5896ea3 10.15.107.150:7000 handshake - 1452240336820 0 0 disconnected ac640f7ee5977e5034d8cbc99ec941f6dfed32f2 10.15.107.149:7000 master - 0 1452240336820 73 connected 5461-10922 86f02180767de571744d99114634f241459d531e 10.15.107.179:7001 myself,slave e89c21ed2ba1666b8727260306f38bb9c2d79d84 0 0 67 connected e89c21ed2ba1666b8727260306f38bb9c2d79d84 10.15.107.180:7001 master - 0 1452240336821 75 connected 0-5460 7ba14e2e32d0603bc19e901e93c3fe08a2e81a83 10.15.107.179:7000 master - 0 1452240335817 74 connected 10923-16383 8a536d822b5ae1c4344834f326fd3a9d14999193 10.15.107.180:7000 slave ac640f7ee5977e5034d8cbc99ec941f6dfed32f2 0 1452240335317 73 connected 380f31fb0c6b6422b06190b79220b8540b9e29c6 10.15.107.149:7001 slave 7ba14e2e32d0603bc19e901e93c3fe08a2e81a83 0 1452240337322 74 connected 5e24e8ec4abcaa1ec2e962990cf824f158730554 10.15.107.150:7001 handshake - 1452240335016 0 0 disconnected
and I has check follow: 1>there is no redis instance runing in 10.15.107.150 2> port :7000 , 17000, 7001and 17001 are no use in 10.15.107.150 thanks help
Comment From: antirez
Hello, unfortunately what I see is that the handshake node was created with a timestamp which is just a few minutes before you posted this.
There are only two ways this can happen: 1. You fail to send CLUSTER FORGET to all the nodes in the cluster. So eventually there are nodes that still has a clue about this other node, and it will inform the other nodes via gossip. Make sure to send CLUSTER FORGET to every single node in the cluster. 2. Or alternatively, there is an instance running in 10.15.107.150 but you said there is not.
Maybe it's "1"?
Full doc for CLUSTER FORGET can be found here: http://redis.io/commands/cluster-forget
Comment From: luweijie007
@antirez you are right it's 1 point ok, I finally delete the shutdown nodes by sending CLUSTER FORGET to every single node in the cluster throught shell scprit , post this script here if someone need it
echo "usage: host port"
nodes_addrs=$(redis-cli -h $1 -p $2 cluster nodes|grep -v handshake| awk '{print $2}')
for addr in ${nodes_addrs[@]}
do
host=${addr%:}
port=${addr#:}
del_nodeids=$(redis-cli -h $host -p $port cluster nodes|grep -E 'handshake|fail'| a
wk '{print $1}')
for nodeid in ${del_nodeids[@]}
do
echo $host $port $nodeid
redis-cli -h $host -p $port cluster forget $nodeid
done
done
thanks again for your help
Comment From: antirez
Glad you solved it! Cheers.
Comment From: kinghrothgar
So that script almost worked for me. The POSIX splitting of the host and port didn't work in your version for me though. Here's what I used:
#echo "usage: host port"
nodes_addrs=$(redis-cli -h $1 -p $2 cluster nodes|grep -v handshake| awk '{print $2}')
echo $nodes_addrs
for addr in ${nodes_addrs[@]}; do
host=${addr%:*}
port=${addr#*:}
del_nodeids=$(redis-cli -h $host -p $port cluster nodes|grep -E 'handshake|fail'| awk '{print $1}')
for nodeid in ${del_nodeids[@]}; do
echo $host $port $nodeid
redis-cli -h $host -p $port cluster forget $nodeid
done
done
Comment From: carlvine500
nice shell ! it resolved my problem.
Comment From: haorenfsa
I know it's closed, but I must add this comment which I believe is usually the case:
if u search CLUSTER nodes blacklist in redis code, you will find that a node 'N' that were forgot by a node will be re-add back if anyone of the other nodes fail to forget the node 'N' within 1 minute.
Comment From: bitsnacker
This is closed. But, if you are still facing this in 2024, I wrote a piece around the fix.
https://bitsnacker.com/posts/redis-remove-a-node-in-handshake-state/