Redis [BUG] XREADGROUP with no results does not reset XINFO CONSUMERS idle timer

Describe the bug

The idle field of the XINFO CONSUMERS return value is documented as:

idle: the number of milliseconds that have passed since the consumer last interacted with the server

This number does not reset when the XREADGROUP command for that consumer returns an empty result.

To reproduce

XGROUP CREATE my-stream my-group 0 MKSTREAM
XGROUP CREATECONSUMER my-stream my-group my-consumer

… some time passes …

XREADGROUP GROUP my-group my-consumer COUNT 10 BLOCK 2000 STREAMS my-stream >
XINFO CONSUMERS my-stream my-group

Expected behavior

The idle field in the XINFO CONSUMERS return value, based on the phrase "since the consumer last interacted with the server" in the documentation, seems like it should return the time since that XREADGROUP command finished, since XREADGROUP is an interaction from the consumer.

Additional information

When XREADGROUP returns a non-empty array, it does seem to reset the idle timer, but not when it returns an empty array.

I'm trying to add functionality to a streaming library to remove stale consumers (for example, stream processors that have crashed due to OOM or SEGV) but it does not appear to be possible to automate for a low-volume consumer since the last message and the last interaction appear to be set to the same time.

Comment From: sundb

@jgaskins Because length of my-stream' is 0, XREADGROUP command will block until times out, during this time the consumer is not actually read, because there is no need to read it.

Comment From: jgaskins

during this time the consumer is not actually read

@sundb The documentation says "since the consumer last interacted with the server".

Comment From: jgaskins

To clarify that, it implies that any time I send a command to Redis on behalf of a consumer in a group in a stream, this timer will be reset.

Comment From: sundb

@jgaskins It does seem to be ambiguous. Just like when XREADGROUP returns an error, the consumer's idle is not updated. ping @itamarhaber

Comment From: jgaskins

@sundb There's certainly a discrepancy between the documentation and observed behavior, but the wording of the documentation isn't ambiguous.

If the intent of this timestamp is to be the last time this consumer received a message, that's very different than it marking the last interaction with the server.

Just like when XREADGROUP returns an error, the consumer's idle is not updated.

I could see an argument for this since state mutations needs to be atomic, similarly to how an error in a DEL command won't change state. I don't agree with it in this case ("last interaction" doesn't imply "last state change"), but I can at least see the argument for it.

The Redis docs are great, but when there's a discrepancy between the docs and observed behavior, it's important to know which one is the intended behavior. I work in open-source, too, so I know things get a bit out of sync sometimes. If the documented behavior is the intended behavior, it's a bug. If the observed behavior is intended, I'm happy to post a PR to update the docs.

Comment From: tfn-smartthings

Hi. Has there been any more thoughts around this ticket? Like @jgaskins we were looking to automate some cleanup of stale consumers by using the XINFO CONSUMERS command, but based on the outcome of this ticket we might need to rethink how we do this.

Comment From: guybe7

@jgaskins @tfn-smartthings indeed it seems that "idle" actually means "milliseconds since data was read by this consumer" rather than the documented definition.

i do think that it's also important to expose "milliseconds since last interaction"

we have two options: 1. re-purpose "idle" in 8.0 (breaking change). only if we don't think exposing "milliseconds since data was read by this consumer" is valuable. 2. add another field "not-seen-since" (or another name). this can already be done for 7.2.

@itamarhaber @oranagra thoughts? i prefer (2) obviously

Comment From: oranagra

since i suppose both can be useful, and i also suppose that the current behavior is more useful than the documented one, am i right? i'd vote to update the documentation and add the new field for the other purpose.

in any case, i usually prefer to update the docs rather than introduce a breaking change (unless the implemented behavior is a plain but that's not at all useful).

am i missing something? i didn't read the whole discussion...

Comment From: tfn-smartthings

From my perspective, I see value in both data points, especially when used in tandem with each other. For example, if we notice that time-since-last-interaction is very recent, but time-since-last-data-read is very large, it could be used as an indication that the number of consumers is more than needed, and thus could be scaled down to find a better balance.

Probably not a first-order use case, but some food for thought.

Comment From: guybe7

@tfn-smartthings do you want to PR redis/redis-docs/both?

Comment From: jGleitz

To add to the discussion:

My team also ran into this issue. Like @jgaskins, we implemented code that should clean up after dead consumers. We relied on the public documentation and had a hard time figuring out why our code doesn’t work.

So for us,

add another field "not-seen-since" (or another name). this can already be done for 7.2.

would be great, since that would address our use case. And I think that use case is general enough to be worth addressing.

Comment From: jgaskins

indeed it seems that "idle" actually means "milliseconds since data was read by this consumer" rather than the documented definition.

@guybe7 Is there a source for this information that I'm overlooking?

re-purpose "idle" in 8.0 (breaking change). only if we don't think exposing "milliseconds since data was read by this consumer" is valuable.

add another field "not-seen-since" (or another name). this can already be done for 7.2.

If the current undocumented functionality is useful to folks and is being used in the wild, it's likely pragmatic to keep it as-is and add a new parameter for the currently documented functionality.

i also suppose that the current behavior is more useful than the documented one, am i right?

@oranagra I don't believe "more" vs "less" useful is the right framing. I'm curious about actual (non-hypothetical) use cases for the current undocumented behavior, but some folks need the documented behavior of IDLE. There doesn't seem to be a way to automate cleanup of dead consumers without it (I spent multiple days trying to find a workaround), leading to multiple operational issues.

Comment From: kelsaka

add another field "not-seen-since" (or another name). this can already be done for 7.2. would be great, since that would address our use case. And I think that use case is general enough to be worth addressing.

Great suggestion! This would also be a very good solution in my project. :)

Comment From: guybe7

update: i went in a slightly different direction: 1. i "fixed" the current code so that seen-time/idle actually refers to interaction attempts (breaking change) 2. i added active-time/inactive to refer to successful interaction (what seen-time/idle used to be)

at first, i tried to avoid changing the behavior of seen-time/idle but then i realized that, in this case, the odds are the people read the docs and implemented their code based on the docs (which didn't match the behavior). for the most part that would work fine, except that this issue here was found.

i was working under the assumption that people relied on the docs, and for the most part, it could have worked well enough. so instead of fixing the docs, as i would usually do, i fixed the code to match the docs in this particular case

Comment From: AliEAKarimi

update: i went in a slightly different direction:

i "fixed" the current code so that seen-time/idle actually refers to interaction attempts (breaking change)

i added active-time/inactive to refer to successful interaction (what seen-time/idle used to be)

at first, i tried to avoid changing the behavior of seen-time/idle but then i realized that, in this case, the odds are the people read the docs and implemented their code based on the docs (which didn't match the behavior). for the most part that would work fine, except that this issue here was found.

i was working under the assumption that people relied on the docs, and for the most part, it could have worked well enough. so instead of fixing the docs, as i would usually do, i fixed the code to match the docs in this particular case

It seems that the problem still exists and idle becomes zero when the consumer succeeds in picking up the message from the stream through the xreadgroup command, and it does not become zero when the queue is empty. I tested with versions 7.2.1 and 7.0.10. But he was right until a few days ago and recently his behavior has changed.

Comment From: AliEAKarimi

update: i went in a slightly different direction:

i "fixed" the current code so that seen-time/idle actually refers to interaction attempts (breaking change)

i added active-time/inactive to refer to successful interaction (what seen-time/idle used to be)

at first, i tried to avoid changing the behavior of seen-time/idle but then i realized that, in this case, the odds are the people read the docs and implemented their code based on the docs (which didn't match the behavior). for the most part that would work fine, except that this issue here was found. i was working under the assumption that people relied on the docs, and for the most part, it could have worked well enough. so instead of fixing the docs, as i would usually do, i fixed the code to match the docs in this particular case

It seems that the problem still exists and idle becomes zero when the consumer succeeds in picking up the message from the stream through the xreadgroup command, and it does not become zero when the queue is empty. I tested with versions 7.2.1 and 7.0.10. But he was right until a few days ago and recently his behavior has changed.

@guybe7