Redis [BUG]Nodes are too slow to join the cluster

Describe the bug Hi，I'm a worker of tendis, when I trying to put this commit redis#7330 into tendis, I found that this commit may cause new node need more time to be known by everynodes redis#7330 fix clusters mixing accidentally by gossip This commit solves the problem of clusters mixing of two different clusters. The principle is that when processing Gossip messages in PING packets, for a new node that it does not recognize, it only adds it to its own node list without doing handshake.

Suppose node A sends a meet message to the current node B. The message contains a message for a node C that the current node B does not recognize. Before the commit, the current node B will send a handshake (which is actually of the meet type) to the C node directly for the C node that it does not recognize in the meet message. After the commit, for the C node, it will be added to the local node, and then send a ping message to the C node when scheduling. Before the commit, the C node receives a meet message from B and quickly gets information about the B node. After the commit, the C node receives the ping message from B, but ignores it because it is not aware of B, which prevents it from adding the B node immediately. At this time, it is necessary to wait until node A sends a ping message to node B with node B in the message before it is possible to add node B to the local. Moreover, whether or not the ping message from node A carries node B inside is also a random behavior (one-tenth of the nodes, at least 3), which may lead to multiple rounds of ping before node B is added. From the above analysis, it can be seen that the more cluster nodes there are, the more rounds of ping messages may be needed for complete consistency of cluster information for all nodes, and the longer it takes.

I would like to know if you addressed the issue and if so how!

Comment From: ShooterIT

At this time, it is necessary to wait until node A sends a ping message to node B with node B in

don't follow you from here.

for https://github.com/redis/redis/pull/7330, i think it does: If PING message is from trusted nodes, we only add nodes from message into to trusted nodes instead of trying to MEETing them. of course, if PING message is from unknown nodes, just ignore.

for gossip, generally it needs more time for all nodes to reach consensus if there much more nodes.

Comment From: SamuelSze1

At this time, it is necessary to wait until node A sends a ping message to node B with node B in

don't follow you from here.

for #7330, i think it does: If PING message is from trusted nodes, we only add nodes from message into to trusted nodes instead of trying to MEETing them. of course, if PING message is from unknown nodes, just ignore.

for gossip, generally it needs more time for all nodes to reach consensus if there much more nodes.

because after the commit, nodes will not accept cluster messages from unrecognised nodes, now node C Knows node A but not node B，so node C will accept ping data from node A but ignores the ping from node B, node C could only get node B info from the ping of node A and add node B to it's node list, this cost much more time for reach consensus than before Moreover, whether or not the ping message from node A carries node B inside is also a random behavior (1/10 of the nodes, at least 3), which may lead to multiple rounds of ping before node B is added. a round of ping cost 7.5s, the time cost from ms level to s level .I think this is too long for a node to join the cluster.

Comment From: ShooterIT

cost 7.5s

How big is your cluster? and how did you evaluate this time?

Comment From: SamuelSze1

7.5s is the default time for send ping (cluster_node_timeout default 15000ms) Actually every node in the cluster will also send some ping to random node, and the ping won't always carried the new node(1/10 of the nodes, at least 3,random), so the 7.5s won't be exactly right, but the correct time is hard to calculate，so we use 7.5s

In the two graphs below，a new node join into a cluster, the cluster have ABC three nodes, and the new node send meet to node C Before the commit, after new node handshake with node C, new node will get node AB info from the Pong or Meet from node C, new node will add nodeAB to his nodelist, and send meet message, this cost about ~100ms (see clusterCron() clusterNodeCronHandleReconnect() clusterLinkConnectHandler()) when new node send meet message to nodeAB, they will start handshake, let nodeAB reach consensus, and bring more node info to new node and start more hanshake. The new node info spreading very fast, all node could reach consensus within a few hundred ms

After the commit, new node won't send meet message to nodeAB after it get nodeAB info from node C, it send a ping message. Because the new node didn't join the nodelist of nodeAB, so nodeAB won't process the gossip section when they get ping from new node, and won't put new node to their nodelist The only way for nodeAB to get new node info is the ping from trusted node(node C), It takes 7.5s for ping messages to be sent once(and random ping some nodes every 1s), this cost much more time, and the ping may not carried new node info, if the cluster size very big, it will cost much more ping, because the ping have lower chance to carried the new node.

There's some wrong with this diagram, Node C should send ping to new node after the commit, not meet

Here are some tests I did myself，add nodes to the cluster one by one, and test the time for every node get new node info Before the commit: After the commit: We could see that there's a big difference after the commit. For a 12 servers cluster, the time for meet finish increased from 600ms to 10s

Comment From: ShooterIT

Thank you @SamuelSze1 it seems redis doesn't resolve this issue, if you have idea, welcome to share for now, maybe you can send a meet message for every node of cluster when adding a new node

Comment From: SamuelSze1

Thank you @SamuelSze1 it seems redis doesn't resolve this issue, if you have idea, welcome to share for now, maybe you can send a meet message for every node of cluster when adding a new node

Thank you for listening. In my opinion, the main problem is the system wouldn't recieve a ping from strangers even the sender already finish handshake with other node in cluster. So I think redis could recieve a ping from strangers if those ping carry a node info which is known. In gossip, if node A send a ping msg with node B info, that means node A must finish handshake with node B, node A should join the cluster of node B and finally known by every node in cluster. So I think recieve a ping with known node info is ok. This should solve the problem of this issue and keep the fix of https://github.com/redis/redis/pull/7330. And no need to change too much things. But this just a simple idea, could have many potential problem, may not work in actual. Hope you guys could find a better solution so we could easily follow up. Thank you so much.

Comment From: takenliu

I have an idea, could you please tell me if it's feasible?@ShooterIT

Here it is mentioned that if A knows B, and B knows C, then A will establish a connection with C. That is to say, A should trusts C, but here, A should trust C(nodeid), not C(ip+port). When A establishes a connection with C, it should verify C's nodeid. Therefore, when A sends a meet packet to C, it includes C's nodeid. However, when the administrator sends the “cluster meet C” command to A, the meet packet that A sends to C should not include C's nodeid. When C receives the meet packet, if it contains a nodeid, it compares it with its own nodeid and only accepts the message if they are the same. If it does not contain a nodeid, there is no need for comparison, and the message is accepted directly.

Specific modifications are as follows:

struct _clusterNode {
    char name[CLUSTER_NAMELEN]; 
    char report_name[CLUSTER_NAMELEN];  // Add a report_name to save the nodename carried by the ping packet, which is used when sending the meet packet.
};

typedef struct {
    uint16_t ver;  // Change ver to 2.
    char sender[CLUSTER_NAMELEN];
    char receiver[CLUSTER_NAMELEN]; // Add a receiver's nodename for the receiver to verify.
} clusterMsg;

Comment From: ShooterIT

In gossip, if node A send a ping msg with node B info, that means node A must finish handshake with node B, node A should join the cluster of node B and finally known by every node in cluster.

@SamuelSze1 interesting idea, you mean if A know B, C know B, so A can ping C, right? seems right, when A ping C, A must get C info from other nodes. that may accelerate consistency since two nodes don't need to know info on each other. But i am not sure if there are potential issues

Comment From: ShooterIT

@takenliu thank you, it is a big topic to change gossip protocol version which break compatibility, currently, we don't tend to.

Comment From: SamuelSze1

In gossip, if node A send a ping msg with node B info, that means node A must finish handshake with node B, node A should join the cluster of node B and finally known by every node in cluster.

@SamuelSze1 interesting idea, you mean if A know B, C know B, so A can ping C, right? seems right, when A ping C, A must get C info from other nodes. that may accelerate consistency since two nodes don't need to know info on each other. But i am not sure if there are potential issues

Yes, This solution may have potential issues and it cannot completely solve the problem. @takenliu 's method could completely solve the problem but need to change the protocol I think I find a better way to solve this problem. Now the main problem is we need to verify the node isn't in the same cluster. @takenliu method is send the receiver's nodename so the receiver could verify both the nodename. But I think verify the nodename no need to change the protocol.

Since now, node will send ping when get a new node info. And new node won't receive those ping msg if receive don't know the sender. But the receiver still will return a PONG to sender. And the header of PONG must carried the nodename of receiver. So we could verify the local receiver's nodename with the receiver's nodename from PONG.

Therefore, we could add a PREHANSHAKE state for verify both the nodename in those PING/PONG. If the sender check that both the nodename are same, the sender could send MEET to the receiver and start handshake as before. Here are the draft of the idea. 1. A get node B info, A put node B info to local and set it to PREHANDSHAKE state 2. A send ping to node B 3. B won't receive this ping msg, but still will return a pong to A 4. B will put it's nodename to the header of the pong, A could use this to check is nodename of B in local correct 5. If correct, A unset PREHANDSHAKE state of B, rename the node B to randam ID and set it to HANDSHAKE and MEET state 6. A send MEET to B and start a handshake, same as before

This way could also compatible with previous versions and no need to change the protocol