Currently, Cluster clients can use either the older CLUSTER NODES command or the newer CLUSTER SLOTS command.

CLUSTER NODES is more difficult to parse or extend, because it produces a textual line-based and not a RESP-friendly reply. It also mixes up topology information along with administrative state information which is less relevant for clients.

CLUSTER SLOTS addresses the CLUSTER NODES shortcomings and produces a native RESP reply. However, it groups node information per hash slot range so in a fragmented cluster it is very inefficient.

The goal is to come up with a new command (or an enhancement to CLUSTER SLOTS, if possible) that is both easy to parse, extensible and efficient.

Comment From: yossigo

@madolson I'm putting this here as a placeholder, please feel free to add any thoughts you already have on this subject.

Comment From: hpatro

I'm taking a look at this.

Comment From: madolson

@yossigo My expectation is that the command would look something like:

> CLUSTER SHARDS (or TOPOLOGY)
Returns:
Array of shards:
    Map of shard attributes:
        Slots -> Even length Array of start/stop pairs (or empty list if there are none
        Nodes -> Array of nodes:
            Primary
            Replica 1
            Replica 2

Each node will have the following information, provided as a map and not as offsets like slots: * id -> Node ID * port -> Client port (Should we handle TLS ports here?) * endpoint -> Either IP, hostname, or NULL * ip -> IP Address or announced IP address * hostname (If available) -> Announced hostname * replication offset (If a replica) -> The replication offset from the primary * status -> (ONLINE, LOADING, PENDING_FAIL, FAIL, NO_SLOTS)

Example, 3 nodes. 2 of them are in the same shard, one is by itself without slots.

1) 1) "slots"
   2) 1) (integer) 0
      2) (integer) 5460
   3) "nodes"
   4) 1) 1) "id"
         2) "09dbe9720cda62f7865eabc5fd8857c5d2678366"
         3) "port"
         4) (integer) 6379
         5) "endpoint"
         6) "host-1.redis.example.com"
         7) "ip"
         8) "127.0.0.1"
         9) "hostname"
         10) "host-1.redis.example.com"
         11) "status"
         12) "ONLINE"
      2) 1) "id"
         2) "821d8ca00d7ccf931ed3ffc7e3db0599d2271abf"
         3) "port"
         4) (integer) 6379
         5) "endpoint"
         6) "host-2.redis.example.com"
         7) "ip"
         8) "127.0.0.1"
         9) "hostname"
         10) "host-2.redis.example.com"
         11) "replication-offset"
         12) (integer) 14000
         13) "status"
         14) "ONLINE"
2) 1) "slots"
   2) (empty array)
   3) "nodes"
   4) 1) 1) "id"
         2) "044ec91f325b7595e76dbcb18cc688b6a5b434a1"
         3) "port"
         4) (integer) 6379
         5) "endpoint"
         6) "host-1.redis.example.com"
         7) "ip"
         8) "127.0.0.1"
         9) "hostname"
         10) "host-1.redis.example.com"
         11) "status"
         12) "NO_SLOTS"

Comment From: yossigo

@madolson This looks good. The map fields can be explicit, so port and tls-port are advertised as-is (i.e. either or both), and clients will figure out which one to use. In this spirit, it may also be a good idea to add an explicit role field per node.

As for the slots, I want to point the extreme case of thinly-interleaved slots (single slot per range), where this is still not optimal. We could improve by supporting both slots and slot-ranges. Normally I wouldn't consider that, but as this is work is already driven by the need to optimize extreme cases maybe it makes sense.

Comment From: PingXie

@madolson, is there a plan to return the primary node's config epoch as well? That will be really useful for anyone who cares about the freshness/staleness of the cluster shard topology.

Piling on the extreme cases, the nodes section can grow quite large too in a large cluster. I was wondering if we could consider supporting a leaner version of "CLUSTER SHARDS" that returns just the IP and ports? This learner version can be opt-in via an optional argument to "CLUSTER SHARDS", say "[BASIC]" maybe?

Comment From: hpatro

@madolson, is there a plan to return the primary node's config epoch as well? That will be really useful for anyone who cares about the freshness/staleness of the cluster shard topology.

Piling on the extreme cases, the nodes section can grow quite large too in a large cluster. I was wondering if we could consider supporting a leaner version of "CLUSTER SHARDS" that returns just the IP and ports? This learner version can be opt-in via an optional argument to "CLUSTER SHARDS", say "[BASIC]" maybe?

@PingXie With the basic version, won't it be the same as CLUSTER SLOTS ?

Comment From: PingXie

Conceptually, yes (or close). There are two important differences though, IMO. The "slot-range" (0-10, 20-30,etc as @yossigo proposed) or "slot-pair" (@madolson's proposal) idea would sport a more compact output (than CLUSTER SLOTS). Then the embedded config epoch would allow the shard topology to be "versioned". The end result is, hopefully, a much denser form of "CLUSTER SLOTS" that reduces the load on both network and CPU on both the producer (Redis) and consumer (application) sides.

Comment From: madolson

I'm not that sure there is much value in two different variants (less and more data). I think a better mechanism for reducing "bytes over the network" would be to figure out how only send changes to the cluster config, as opposed to resending all of the fields. This is likely more related to this issue though https://github.com/redis/redis/issues/10150.

Comment From: PingXie

I like the push model idea in general. #10150 talks about the slot migration scenario. I wonder how it could be extended to the replica state as well, i.e., replicas joining/leaving the cluster. That will be useful for use cases where replicas need to serve (read-only) traffic.

Comment From: zuiderkwast

I think a range of two integers is not too bad even if all ranges are single-slot ranges. It's only a small constant factor.

If M = (the number of slot ranges) and N = (the number of nodes per shard), then CLUSTER SLOTS is O(M * N) while this new command will be O(M + N). This is the most important difference I think.

Comment From: dmitrypol

I am thinking of a way to use CLUSTER SHARDS to monitor the Redis cluster instead of combination of INFO, NODES and SLOTS commands. When one of the nodes in specific shard is down it is important to know about it. What is REALLY important to know is when ALL primary/replica nodes in specific shard are down. With this command client can parse each slots/nodes/health looking for fail. It would be nice to have top level attribute that tells client if cluster is: * completely healthy * partially healthy (one node is down) * failing (entire shard is down).

Also, is there a way to get this info via pub/sub similar to Redis Sentinels?

Comment From: madolson

I am thinking of a way to use CLUSTER SHARDS to monitor the Redis cluster instead of combination of INFO, NODES and SLOTS commands. When one of the nodes in specific shard is down it is important to know about it. What is REALLY important to know is when ALL primary/replica nodes in specific shard are down. With this command client can parse each slots/nodes/health looking for fail. It would be nice to have top level attribute that tells client if cluster is:

Seems like a different use case than what we are trying to solve here.

Also, is there a way to get this info via pub/sub similar to Redis Sentinels?

We're looking into it https://github.com/redis/redis/issues/10150

Comment From: dmitrypol

hi @madolson. Per our chat it would be nice if cluster info had a concept of cluster_health: * healthy - all nodes in all shards are fine * partially_healthy - at least one node in each shard is up. Use cluster shards and alarm which specific node is down (especially if X failures over Y minutes).
* unhealthy - all nodes in at least one shard are down. Use cluster shards and raise high severity alarm.

Comment From: sjpotter

I'm looking at this now in trying to reimplement it for redis raft and I don't understand the need for both ip as a value and endpoint.

why can't IP be subsumed into endpoint? why should we have to specify that it's a specific IP? Should IP be considered optional? (not listed as optional in the docs)?

Comment From: madolson

I believe IP should be considered optional based on the design. Endpoint is the only field that is strictly required, and is the recommended field that clients should use to connect. I think the original thinking was that maybe some clients might want to avoid a DNS resolution by using the IP if provided and use the hostname provided by the endpoint.