Describe the bug
XINFO Lag field is reported wrong in some cases.
To reproduce and expected behavior
Tested against Redis 7.2.4 (d2c8a4b9/0) 64 bit with redis-cli 7.2.4 (git:d2c8a4b9)
I will report two different cases
1) CASE 1
127.0.0.1:6379> XGROUP CREATE planets processing $ MKSTREAM
OK
127.0.0.1:6379> XADD planets 0-1 name Mercury
"0-1"
127.0.0.1:6379> XADD planets 0-2 name Venus
"0-2"
127.0.0.1:6379> XREADGROUP GROUP processing alice STREAMS planets >
1) 1) "planets"
2) 1) 1) "0-1"
2) 1) "name"
2) "Mercury"
2) 1) "0-2"
2) 1) "name"
2) "Venus"
127.0.0.1:6379> XADD planets 0-3 name Earth
"0-3"
127.0.0.1:6379> XADD planets 0-4 name Jupiter
"0-4"
127.0.0.1:6379> XINFO GROUPS planets
1) 1) "name"
2) "processing"
3) "consumers"
4) (integer) 1
5) "pending"
6) (integer) 2
7) "last-delivered-id"
8) "0-2"
9) "entries-read"
10) (integer) 2
11) "lag"
12) (integer) 2
127.0.0.1:6379> XDEL planets 0-1 (Added this after first report, after realizing that this step is necessary for bug to occur)
(integer) 1
127.0.0.1:6379> XDEL planets 0-2
(integer) 1
127.0.0.1:6379> XINFO GROUPS planets
1) 1) "name"
2) "processing"
3) "consumers"
4) (integer) 1
5) "pending"
6) (integer) 2
7) "last-delivered-id"
8) "0-2"
9) "entries-read"
10) (integer) 2
11) "lag"
12) (integer) 2
127.0.0.1:6379> XDEL planets 0-3
(integer) 1
127.0.0.1:6379> XINFO GROUPS planets
1) 1) "name"
2) "processing"
3) "consumers"
4) (integer) 1
5) "pending"
6) (integer) 2
7) "last-delivered-id"
8) "0-2"
9) "entries-read"
10) (integer) 2
11) "lag"
12) (integer) 2 <=== SHOULD BE NIL
```
According to documentation at https://redis.io/docs/latest/commands/xinfo-groups/
```
There are two special cases in which this mechanism is unable to report the lag:
1) ......
2) One or more entries between the group's last-delivered-id and the stream's last-generated-id were deleted (with [XDEL](https://redis.io/docs/latest/commands/xdel/) or a trimming operation).
In both cases, the group's read counter is considered invalid, and the returned value is set to NULL to signal that the lag isn't currently available.
We have deleted an item between last-delivered-id and the stream's last-generated-id. It should report that lag is not available. If we continue to read all items in the stream until the end, there seems to be another related problem.
```
127.0.0.1:6379> XREADGROUP GROUP processing alice STREAMS planets >
1) 1) "planets"
2) 1) 1) "0-4"
2) 1) "name"
2) "Jupiter"
127.0.0.1:6379> XINFO GROUPS planets
1) 1) "name"
2) "processing"
3) "consumers"
4) (integer) 1
5) "pending"
6) (integer) 3
7) "last-delivered-id"
8) "0-4"
9) "entries-read"
10) (integer) 3 <=== SHOULD BE 4. entries-added is 4.
11) "lag"
12) (integer) 1 <=== SHOULD BE 0
Again according to here:
Once the consumer group delivers the last message in the stream to its members, it will be set with the correct logical read counter, and tracking its lag can be resumed.
The lag should be 0. We have read everything there is. There are no undelivered messages that we can read. And following the logic of `lag = entries-added - entries-read` and also doc `The counter attempts to reflect the number of entries that the group should have read to get to its current last-delivered-id`, the entries-read should be 4.
2) Another case, the only difference is that we only delete 0-3 and not 0-2. This time the lag is reported as Nil correctly at first. But it wrong again if we read all the messages in the stream till the end afterwards.
127.0.0.1:6379> XGROUP CREATE planets processing $ MKSTREAM OK 127.0.0.1:6379> XADD planets 0-1 name Mercury "0-1" 127.0.0.1:6379> XADD planets 0-2 name Venus "0-2" 127.0.0.1:6379> XREADGROUP GROUP processing alice STREAMS planets > 1) 1) "planets" 2) 1) 1) "0-1" 2) 1) "name" 2) "Mercury" 2) 1) "0-2" 2) 1) "name" 2) "Venus" 127.0.0.1:6379> XADD planets 0-3 name Earth "0-3" 127.0.0.1:6379> XADD planets 0-4 name Jupiter "0-4" 127.0.0.1:6379> XDEL planets 0-3 (integer) 1 127.0.0.1:6379> XINFO GROUPS planets 1) 1) "name" 2) "processing" 3) "consumers" 4) (integer) 1 5) "pending" 6) (integer) 2 7) "last-delivered-id" 8) "0-2" 9) "entries-read" 10) (integer) 2 11) "lag" 12) (nil) 127.0.0.1:6379> XREADGROUP GROUP processing alice STREAMS planets > 1) 1) "planets" 2) 1) 1) "0-4" 2) 1) "name" 2) "Jupiter" 127.0.0.1:6379> XINFO GROUPS planets 1) 1) "name" 2) "processing" 3) "consumers" 4) (integer) 1 5) "pending" 6) (integer) 3 7) "last-delivered-id" 8) "0-4" 9) "entries-read" 10) (integer) 3 <=== SHOULD BE 4. entries-added is 4. 11) "lag" 12) (integer) 1 <=== SHOULD BE 0
That is all of course, if I understand the document correctly.
**Comment From: sundb**
@sancar thanks, this will be fixed in #13338, do you want a try?
**Comment From: sancar**
Sure @sundb , I will try.
**Comment From: sundb**
@sancar sorry, i realize that the fix in https://github.com/redis/redis/pull/13338 is totally wrong.
in your case, the max deleted id is before the first entry of the stream, so the lag is calculated by using `s->entries_added - s->length`.
please wait for the opinion from @itamarhaber.
**Comment From: sancar**
My tests seem to be all passing now.
Can you also clarify when `entries-read` should be nil ? Redis seems to be returning `Nil` for it sometimes but documentation is not clear on that.
**Comment From: sundb**
@sancar actually, the root cause is the last id of consume group is before the max deleted id, and the first id of stream is after the max deleted id at the same time.
althought this fix works, i'm still concerned about if it may bring some side effects.
**Comment From: artikell**
I think there are inherent issues with this design. entries_read depends on the value of max_deleted_entry_id, which indicates that entries read is related to the order of xread and xdel. For example, the following example:
test {Consumer group lag with with tombstone V1} {
r DEL x
r XGROUP CREATE x processing $ MKSTREAM
r XGROUP CREATE x processing1 $ MKSTREAM
r XADD x 0-1 name Mercury
r XADD x 0-2 name Venus
r XADD x 0-3 name Earth
r XADD x 0-4 name Jupiter
r XREADGROUP GROUP processing alice COUNT 2 STREAMS x >
r XDEL x 0-3
set reply [r XINFO STREAM x FULL]
set group [lindex [dict get $reply groups] 0]
assert_equal [dict get $group entries-read] 2
}
test {Consumer group lag with with tombstone V2} {
r DEL x
r XGROUP CREATE x processing $ MKSTREAM
r XGROUP CREATE x processing1 $ MKSTREAM
r XADD x 0-1 name Mercury
r XADD x 0-2 name Venus
r XADD x 0-3 name Earth
r XADD x 0-4 name Jupiter
r XDEL x 0-3
r XREADGROUP GROUP processing alice COUNT 2 STREAMS x >
set reply [r XINFO STREAM x FULL]
set group [lindex [dict get $reply groups] 0]
assert_equal [dict get $group entries-read] 2
}
The second case will fail. I think this is not in line with expectations. Here, we cannot generate data for lag and entries_read without relying on max_deleted_entry_id.
What do you think? @sundb
**Comment From: sundb**
@artikell please note the last line in the document:
In both cases, the group's read counter is considered invalid, and the returned value is set to NULL to signal that the lag isn't currently available.
``
for stream, he has forgotten0-1and0-2after runningr XREADGROUP GROUP processing alice COUNT 2 STREAMS x >, so we can't calculate theentries_read` reply on our memory.
in short, the stream group can only calculate the entries-read at the current position.
Comment From: artikell
@artikell please note the last line in the document:
In both cases, the group's read counter is considered invalid, and the returned value is set to NULL to signal that the lag isn't currently available.for stream, he has forgotten
0-1and0-2after runningr XREADGROUP GROUP processing alice COUNT 2 STREAMS x >, so we can't calculate theentries_readreply on our memory. in short, the stream group can only calculate the entries-read at the current position.
I understand this issue, but should Redis ensure consistency between entries read and lag. When the lag is invalid, entries read is displayed as valid. This is very confusing. In Xinfo, the result of entries read depends on the execution order of xread, while lag does not.
Comment From: sundb
@artikell thanks, i've been thinking about this too.
there is actually a way to ensure that entries_read always valid, it always reply on xreadgroup command.
i'm busy elsewhere now, need some time.
perhaps you'd like to fix it based on my PR: https://github.com/redis/redis/pull/13338.