The problem/use-case that the feature addresses
When a new node join cluster bus, and become replica of a primary, cluster bus will immediately advocate the new node as a replica of the primary to client. If client query the new node before it caught up on replication, it will get either empty result or rejected response when the new node is loading data.
Description of the feature
Server should not advocate the new replica address to client, until the replica caught up on replication.
Alternatives you've considered
Client should ping the new node to be healthy before it start to send traffic to it.
Additional information
N/A
Comment From: madolson
We talked about this a couple of months ago, and there was some weak consensus that we could add a new field to the cluster slots/nodes output that the node is loading/ready to take traffic. We would need to gossip this new information.
Comment From: eladbern
One possible solution for this without changing cluster bus is to hide empty replicas from CLUSTER SLOTS command. Empty means having a reported replication offset of zero. This relies on the replication offset reported by nodes in cluster PING/PONG messages. Hiding empty replicas from CLUSTER SLOTS (while still showing them in CLUSTER NODES) is consistent with hiding failed nodes since from a data-path POV, an empty replica is not very useful.
Suggest something like:
/* Returns an indication if the replica node is fully available
* and should be listed in CLUSTER SLOTS response.
* Returns C_OK for available nodes, C_ERR for nodes that have
* not finished their initial sync, in failed state, or are
* otherwise considered available to serve read commands. */
static inline int isReplicaAvailable(clusterNode *node) {
if (nodeFailed(node)) {
return C_ERR;
}
long long repl_offset = node->repl_offset;
if (node->flags & CLUSTER_NODE_MYSELF) {
/* Nodes do not update their own information
* in the cluster node list. */
repl_offset = replicationGetSlaveOffset();
}
if (repl_offset == 0) {
return C_ERR;
}
return C_OK;
}
Then use that when generating the cluster slots response:
void clusterReplyMultiBulkSlots(client *c) {
...
...
...
/* Remaining nodes in reply are replicas for slot range */
for (i = 0; i < node->numslaves; i++) {
/* This loop is copy/pasted from clusterGenNodeDescription()
* with modifications for per-slot node aggregation */
if (isReplicaAvailable(node->slaves[i]) != C_OK) continue;
addReplyArrayLen(c, 3);
addReplyBulkCString(c, node->slaves[i]->ip);
addReplyLongLong(c, node->slaves[i]->port);
addReplyBulkCBuffer(c, node->slaves[i]->name, CLUSTER_NAMELEN);
}