Describe the bug In redis cluster, the traffic delivered to each node during pub sub exceeds 10 "multiply".

Additional information linux redis 7.0.10

It is reported like a bug while developing the PUB/SUB of Redis Cluster.

Published two cluster nodes (M/M or M/S) 10kb data. Didn't subscribe for pure bug reproduction

What expect 10KB of data was delivered to the node except the node where the Publish was generated, and if there was a client that was subscribe, 10kB of data would be delivered.

But the actual traffic The client sub is exactly 10kB The node in the cluster delivers 100kB of data. (Check with multiple network monitoring tools)

What is the structure of the pub/sub between the cluster? I think this doesn't make sense

This time, I entered 3,000kb of data per second I have confirmed that 30,000kB is shared on each cluster node. If the cluster node is placed in the same server, only Loopback is 30,000kb If another network, the TX RX is 30,000kb each.

This seems to be a serious problem, but is it right? There is a problem with personally developing personally, so I solve the TCP stream by developing Redis Module without using PUB/SUB. In this case, I confirmed that exactly onexy data was TX/RX.

If you know in this regard, please comment.

Comment From: snz2

I found out that it is related to gossip traffic. Is gosship traffic supposed to be this much? If not a bug, why is that?

Comment From: vitarb

This comment might be relevant:

 * For now we do very little, just propagating [S]PUBLISH messages across the whole
 * cluster. In the future we'll try to get smarter and avoiding propagating those
 * messages to hosts without receives for a given channel.
 * Otherwise:
 * Publish this message across the slot (primary/replica).

https://github.com/redis/redis/blob/unstable/src/cluster.c#L3605-L3615

It looks like all publish messages are broadcasted to the entire cluster (using gossip). With that in mind, if you are using non-sharded pubsub, your PUBLISH traffic would be sent to all nodes in the cluster, even if there are no clients connected to those nodes listening for the updates.

Comment From: snz2

@vitarb

All right. I understand that there is a simple logic involved. By the way, do some actual measurements. It is not the traffic amount of the original data sent * number of cluster nodes, but the similar traffic amount of about original data * cluster nodes * 10.

This is the result of comparing traffic with set. data ['hmmm', 'hmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm'] same server Master 6379/ Replica 6380

set 100,000/ request 16sec = 6250 tps (This is the result of replication data included in loopback) Redis [BUG] In redis cluster, the traffic delivered to each node during publish exceeds 10

pub 100,000/ request 16sec = 6250 tps with no subscribe Redis [BUG] In redis cluster, the traffic delivered to each node during publish exceeds 10

pub 100,000/ request 16sec = 6250 tps - only master (shutdown replica) with no subscribe Redis [BUG] In redis cluster, the traffic delivered to each node during publish exceeds 10

Are you sure this is a problem?

Of course, if you build a cluster on another server and test it, ethernet rx tx instead of loopback occurs as much.

Comment From: hpatro

@snz2 Are we mixing up the traffic from the regular PING PONG message going around in a cluster between the nodes via cluster bus ? Could you also mention the number of primary and replica in your setup?

PUBLISH command would be broadcasted to all connected nodes in the cluster once (irrespective of subscription). SPUBLISH command would be broadcasted to all the primary/replica for a given slot (irrespective of subscription).

Comment From: snz2

@hpatro It is not convincing that the ping pong message generates a large amount of traffic, and my setup is mentioned in the article. One master replica that's all.

Comment From: hpatro

@snz2 As the pubsub message is transferred via cluster bus to the other node(s) it has much higher metadata payload (overhead) than the data transferred via replication stream.

If the data payload size is considerably small compared to the metadata payload, the overhead seems really expensive. Sharded Pub/Sub (spublish/ssubscribe) would also have the same performance for a single shard setup (one primary/replica(s)).

I've two ideas which could help us achieve better performance.

  1. Introduce cluster data link to transfer pub/sub related data across the cluster. This would have the same properties as we have for replication link. This will segregate the gossip data and actual data across different link(s).

  2. If the message is published on a primary for sharded pub/sub, the message will be propagated via replication link instead of the cluster bus link. This will reduce the payload overhead incurred during data publish on primary. Client(s) would need to be smart on establishing connection for data production and consumption. This solution won't be extendable to classic pub/sub though.

Comment From: snz2

@hpatro "HIGHER METADATA PAYLOD" I saw the code and WOW is really big. I understand more than 10 times the data I put as a sample. I thought it was a bug or improvement, but those data are being used well in the Redis? But I need a quick and concise feature rather than that verification data, so I can't use Pubsub. Thank you so much for giving the answer.

Comment From: vitarb

@snz2 I agree, added cost seems too high, unsigned char myslots[CLUSTER_SLOTS/8]; field alone is 2kb in size. @hpatro is this all metadata really needed on every message? Or can we maybe use a lighter weight struct for pubsub message passing?

Comment From: hpatro

@vitarb AFAIK all of the different kind of cluster messages carry this payload currently. So, on introducing a new message format over the existing cluster links, it would be a breaking change. We would have to introduce certain kind of branching to parse the message correctly. We could think more on this. What are your thoughts on the suggestions I've made above ?

Comment From: snz2

@snz2 I agree, added cost seems too high, unsigned char myslots[CLUSTER_SLOTS/8]; field alone is 2kb in size. @hpatro is this all metadata really needed on every message? Or can we maybe use a lighter weight struct for pubsub message passing?

I think that's how I personally implemented a message between the cluster, which is small by calculating the slot on the sending and receiving slot range. Sending all the slots and IP or other information implemented in the current Redis GossiP is a bit inefficient. But this hint is taken from Redis-Cli.

@vitarb AFAIK all of the different kind of cluster messages carry this payload currently. So, on introducing a new message format over the existing cluster links, it would be a breaking change. We would have to introduce certain kind of branching to parse the message correctly. We could think more on this. What are your thoughts on the suggestions I've made above ?

Redis Pubsub is not affected by the slot as it is known, so it is sad to have a dependence on GossiP to send 2KB of slot information.

[{"ip":"10.0.0.1:6378","slots":[["0","16383"]]},{"ip":"10.0.0.1:6379","slots":null}] or with "id":"bceae737fc3d9462333c0b90c134054f63598018" This is the JSON I am currently using, but if you send it like this, it won't seem to send 2KB within the One Master Replica.

I know that the current structure is like this ...

slot 1:10.0.0.1,bceae737fc3d9462333c0b90c134054f63598018 2:10.0.0.1,bceae737fc3d9462333c0b90c134054f63598018 3:10.0.0.1,bceae737fc3d9462333c0b90c134054f63598018 . . . 16383:10.0.0.1,bceae737fc3d9462333c0b90c134054f63598018 16384:10.0.0.1,bceae737fc3d9462333c0b90c134054f63598018

This data is exchanged every time ....

Even if you convert it to binary, it's terrible to have the same data as a *16384.

As HPATRO says, if you change the Sender and Receiver to Range for slot without having to change the whole, it will be possible without a big change.