Should we add an additional data structure to reduce clusterGenNodesDescription() complexity?

In our scale-up redis cluster scenario, we parallel migrate slot so quickly (may has few keys) that set slot qps is too high, which leads to high cpu usage(almost to 100%).

And in our another scenario, we has a large redis cluster which has more than 400 nodes. And the cluster nodes commands may cost 10~15ms, it is unimaginable in redis.

Both of two scenario caused by clusterGenNodesDescription function. It has a high complexity(O(num_of_master * 16384)) when calculate node slots info.

Comment From: madolson

Do you mean clusterGenNodeDescription()? (The one without the s) I've thought that it would be possible to cache the value of the description and then invalidate it if anything changes. That should resolve your case where most of the time most of the nodes aren't changing. This would help larger clusters scale better.

Comment From: ShooterIT

Interested in this issue. In our cluster system not redis cluster, i also occurred similar issue, our config server can't afford high QPS to get cluster topology by many proxies at the same time. I use cache to solve this problem. But before solution, we should get some perf info or flame graph by https://github.com/brendangregg/FlameGraph

Comment From: gripleaf

Do you mean clusterGenNodeDescription()? (The one without the s) I've thought that it would be possible to cache the value of the description and then invalidate it if anything changes. That should resolve your case where most of the time most of the nodes aren't changing. This would help larger clusters scale better.

yep, clusterGenNodesDescription always call clusterGenNodeDescription.

cache the value of the description and then invalidate it seems a good idea in some cases, but our proxies would not ask for cluster topology frequency (every 1~2 minutes or detect topology changed). And when cluster topology change faster, the cache may not work well.

Below is the perf of a master node in migrate scenario(migrate from a 37/37 to another 37/37 without user request). Redis [NEW] clusterGenNodesDescription can run faster?

here is svg attachment: cluster-37-37.svg.tar.gz

Comment From: gripleaf

actually i create a data structure rangelist to record node slot info, like

/**
 * records a range [left, right] and the next link node
 **/
typedef struct rangelistNode {
    struct rangelistNode* next;
    int left;
    int right;
} rangelistNode;
/**
 * a sorted link list, every node in this list satisfy follow condition:
 *   - node->right < node->next->left
 *   - node->left >= node->right
 * ps. here use AVL would have a better performance
 **/
typedef struct rangelist { 
    rangelistNode* head;
} rangelist;

when use this data sturcture, the complexity of clusterNodeClearSlotBit and clusterNodeSetSlotBit may increase from O(1) to O(16384)(worst).

To verify my idea, I test it in the same migrate scenario like above. Bellow is the perf: Redis [NEW] clusterGenNodesDescription can run faster?

here is svg attachment: cluster-37-37-opt.svg.tar.gz

Comment From: ShooterIT

From your first graph, as we expect, clusterNodeSetSlotBit will be called "nodes number * 16384", if we have many modes, cluster nodes commands will be costly. Maybe when need to generate all nodes description, we can only generate slots topology once instead of generating for every nodes, so complexity will be from O(16384*nodes) to O(16384) .

For topology cache, and just identify the cache invalid when topology is changed, and only update when we want to get. Of course, we should verify it is worth to do.

Comment From: gripleaf

generate slot topology in advance is a good idea. code complexity is simplier than data structure.

below is my implemenation about generate slot topology before call clusterGenNodeDescription

sds clusterGenNodesDescription(int filter) {
    long long start = ustime();
    sds ci = sdsempty(), ni;
    dictIterator *di;
    dictEntry *de;

    // insert code begin
    clusterNode* n = NULL;
    int begin = -1;
    for (int i = 0; i <= CLUSTER_SLOTS; ++i) {
        if (n == NULL) {
            if (i == CLUSTER_SLOTS) continue;
            n = server.cluster->slots[i];
            begin = i;
            continue;
        }
        if (i == CLUSTER_SLOTS || server.cluster->slots[i] == NULL || n != server.cluster->slots[i]) {
            if (begin == i - 1) {
                n->slot_info = sdscatprintf(n->slot_info, " %d", begin);
            } else {
                n->slot_info = sdscatprintf(n->slot_info, " %d-%d", begin, i-1);
            }
            n = server.cluster->slots[i];
            begin = i;
        }
    }
    // insert code end

    di = dictGetSafeIterator(server.cluster->nodes);
    while((de = dictNext(di)) != NULL) {
        clusterNode *node = dictGetVal(de);
        // ...
    }
    // ...
    return ci;
}

And, i also test it in same scenario. In the test, compare to the data structure idea, generate slots in advance cost much time on clusterGenNodesDescription, but less time on clusterCommand(which is set slot command).
Bellow is perf: Redis [NEW] clusterGenNodesDescription can run faster?

here is svg attachment: cluster-37-37-opt-genslot.svg.tar.gz

Comment From: ShooterIT

Looks good @gripleaf FYI, from your test, we can notice thant sdscatprintf is costly. Actually sdscatfmt is better than sdscatprintf, you can try to replace sdscatprintf with sdscatfmt, also please notice that format specifiers of sdscatfmt are different from sdscatprintf

Comment From: gripleaf

Cool! After replace sdscatprintf with sdscatfmt, the clusterGenNodesDescription has a better performance. Bellow is perf: Redis [NEW] clusterGenNodesDescription can run faster?

Comment From: ShooterIT

Hi @gripleaf I think we can make a PR to optimize this problem. You can do it if you want. If you hope I make a PR, I will do it.

Comment From: gripleaf

I would like to make a PR, but for some reason i can't do it. I hope you make a PR please. Thank you so much.