Describe the bug
We have been observing strange behaviour with our AWS Elasticache deployment. We have two Redis nodes - 1 master and one follower. Both instances are in same AWS zone.
We publish significant number of messages through Redis Pubsub mechanism and than have dozen of subscribers. Some subscribers have been missing messages occasionally and we have confirmed that all subscribers missing messages have been subscribed to follower.
This was really hard to debug and but we have observed is that cluster_stats_messages_publish_sent (on master) & cluster_stats_messages_publish_received (on follower) slowly diverge. At the same time number ping/pong messages seems to diverge too.
This can be seen here
We have confirmed that network throughput is not saturated and even increased instance size (with better network). No improvements have been observed.
To reproduce
PUBLISH significant number of messages SUBSCRIBE on follower replica
Expected behavior
No messages would be lost, or some operation fail, or this is easily observable from client side so that it can reconnect.
Comment From: madolson
@brandys11 This is by design of pubsub in Redis, in that it makes no guarantees about delivery. In a sense it's "at most" once, in that internally it makes sure that each node only receives a message at most one time. This may just not work well for your implementation.
If you want a more reliable mechanism for receiving messages, you might consider using streams instead, as it provides a mechanism to subscribe to a point in time of the stream.
Comment From: hpatro
As @madolson mentioned it's by design and don't support guaranteed delivery. Marking as state-to-be-closed.