Redis [BUG] xclaim claims the message but returns nil

Describe the bug

(I know 1 as min_idle_time is not the finest idea but it was in order to get an clean example)

Message is claimed but it is not returned as the doc specifies.

Redis [BUG] xclaim claims the message but returns nil

To reproduce

Sadly I do not know :shrug:. We are running redis on aws elasticache, working fine 99.9% of the time but sometimes this happens and it's impossible to access the message.

Expected behavior

The message to be returned.

Additional information

I spent hours trying to understand this and I'm nearly desperate, i have not idea about what to do and i do not want to drop redis :cry:

Comment From: madolson

Hi, I'm from ElastiCache. The only thing ElastiCache specific that might have introduced some interesting behavior is related to full sync during replication, do you know if there were any snapshots going on. Also what Redis version are you running on?

For non-elasticache questions, is that example a real example of this odd behavior or just what it looked like? You mentioned you can't reproduce it, just wanted to understand.

Salvatore clearly saw something similar, https://github.com/redis/redis/commit/6ba50784b5ce2e4eae74da00536ebbc1f81984ae, but I don't know why his fix makes any sense. "Sometimes entries may not be emitted, producing broken protocol where the array length was greater than the emitted entires, blocking the client waiting for more data." So there are apparently valid cases, so maybe understanding your use case would help a bit.

Comment From: theophanevie

Hi, I'm from ElastiCache. The only thing ElastiCache specific that might have introduced some interesting behavior is related to full sync during replication, do you know if there were any snapshots going on. Also what Redis version are you running on?

First of all, Thanks a lot for taking the time to help us ! We are running ElastiCache on version 6.0.5. I just checked snapshots but there were no snapshots running at the time the message was sent.

For non-elasticache questions, is that example a real example of this odd behavior or just what it looked like? You mentioned you can't reproduce it, just wanted to understand.

A consumer crashed and while another one was recovering successfully the ~ 100 000 messages in pending state, 260 messages seem to be stuck. The screen provided is me trying to understand what is happening (via redisinsigh 1.9.0, so yes it is real values).

If I consider that despite the fact that XCLAIM returns nil the message is claimed successfully and I XACK the message, the message is acked correctly.

This is the second time this appears in 5 weeks, but the first time it was about 9 messages. In both cases all messages are consecutive in time.

Salvatore clearly saw something similar, 6ba5078, but I don't know why his fix makes any sense. "Sometimes entries may not be emitted, producing broken protocol where the array length was greater than the emitted entries, blocking the client waiting for more data." So there are apparently valid cases, so maybe understanding your use case would help a bit.

We aren’t doing anything fancy, each time an object is modified in some database we put its new representation in a stream. (max entry is about 500 items of less than 100 Kib per second) There is currently one consumer group reading this stream.

Comment From: guybe7

@theophanevie please read https://github.com/redis/redis/issues/7021#issuecomment-1027939792

Comment From: theophanevie

thanks a lot !