Redis Breaking API changes in RESP3 - Nineya|java/go/python

So... I'm in the process of implementing RESP3 client changes in SE.Redis; @slorello89 kindly commented on breaking API changes, which ... confused me, but after investigating: they're right; so far I have identified only two:

ZRANGE (result pairs become jaggified)
XREAD (drops a level of nesting between the array and map representation)

I do not know for sure that this list is exhaustive.

In both cases, the shape of the returned data changes in a non-trivial (i.e. calling-code-breaking) way based on whether the connection is RESP2 vs RESP3. What makes it worse is:

it is completely undocumented on the command reference
it is completely undocumented on the protocol specification
if we look at the ZRANGE change, this major API change was not even called out or discussed on the PR
oh, and it isn't even just "on RESP3" - it is is "on RESP3 and some server version > 6.2.6" (I haven't found the exact version, sorry)
(I haven't tracked the XREAD change; please feel free to do so; nor can I reasonably find every such change, unless I stumble over them in my test rigs)

I don't want to beat about the bush; this is bad. Very bad. Like: jaw-droppingly hostile towards consumers; I think there have been multiple significant failings here - and please understand: I get it, these things happen - the point of this post is this cannot keep happening.

In order:

The returned data shape is PART OF THE API; these major API breaks SHOULD NOT HAVE HAPPENED
Or if they did happen, it should have been with an opt-in argument
Or at least, in a "major"
And it must be clearly documented in the command specification
And it should have been properly discussed in the PR when the change was made (ideally with the result "it doesn't happen")

From a client library angle, I now need to know a lot of variations (server version, protocol, etc) - and they aren't documented.

I'm looking for some help from Redis folks here; hopefully some positive outcomes; IMO:

all such API breaks should be clearly documented on the relevant command pages, including protocol and server version matrix
all such API breaks should be collated somewhere, including protocol and server version matrix
please please please don't make any more of these! (changing from an interleaved "array" to "map": fine, no problem! but changing the nesting layout or jaggifying an array? nononononononono)

(if 1/2 sound like a lot of work: yes! yes, it is - and right now you're pushing that work to all the users and library maintainers, probably to be found the hard way)

Help me Redis-Kenobi; you're my only hope!

Comment From: uglide

Hi @mgravell

I appreciate your honesty in sharing your feelings and perspective on RESP3 changes. It's really frustrating experience for any Redis client developer to rely only on integration tests to detect such breaking changes between RESP2 and RESP3 responses.

I think the intention with RESP3 was to apply all fixes and typical data transformations that other clients do for RESP2 responses. That's why in this particular case with ZRANGE WITHSCORES, Redis now returns an array of arrays to represent data close to how it looks in the client API (e.g Lettuce and StackExchange.Redis).

Also, the team made a great effort to expose an up-to-date reference on reply schemas for all commands through the COMMAND command.

Still, the above doesn't justify the absence of a RESP2 to the RESP3 migration guide for command responses that should serve as a guideline to any client developer.

I'm working on such a documentation page to systemize all "lessons learned" during the RESP3 implementation in official Redis clients: redis-py, node-redis, jedis, etc., and will share a draft soon.

Comment From: mgravell

That's great. Of particular note / importance is emphasizing that some of these are not just protocol dependent, but server version dependent, so: any such guidance also need to emphasize "you need to check the shape; it could look like X or Y"

Comment From: mgravell

I think I must have gotten very lucky; the fix I put in after ZRANGE, in SE.Redis, caught almost all of these (we have a generalized handler for interleaved data). There's a few more in here I still need to look at, but: mostly seems to be that one scenario. Personally I would have advocated for "let's add a new prefix token that distinguishes unordered pairs from ordered pairs" rather than go jagged, but: what's done is done.

Also, the team made a great effort to expose an up-to-date reference on reply schemas for all commands through the COMMAND command.

That's great, but in reality I think most people use the docs, and AFAIK those don't reflect these changes

Comment From: oranagra

@mgravell I'm not certain i understand what you mean by jagged response in ZRANGE. AFAIR both of these commands were changed from flat array to nested one when RESP3 was introduced in 6.0, so that makes it an opt-in flag, and as Igor mentioned AFAIK the whole reason for this protocol change was to make client libraries simpler.

since you explicitly mentioned 6.2.6 i assume i'm missing something, so please give me more details, and i'll track down these changes.

you're of course right that it should documented in each command (not in the protocol spec), and as Igor mentioned we're in the the process if making it happen (by adding metadata to the commands in redis itself), see #10273. it should one day also make it to the command docs pages as well.

this major API change was not even called out or discussed on the PR

AFAIK This PR is just an optimization, the change was in 6.0 (when RESP3 was introduced), or maybe i'm missing your point. Note that we did do some breaking changes after 6.0 in order to fix RESP3 inconsistency issues (see #8981) Regrettably, these were done in a patch level release, IIRC we wanted to change the broken reply ASAP hoping that this way the change will cause less breakage.

All these breaking changes (at least the ones that were done on purpose) are clearly documented in the release notes, so if we want to note them in the command docs (and i agree we should), it should be rather easy to find them by going over these notes.

Comment From: mgravell

AFAIK This PR is just an optimization, the change was in 6.0 (when RESP3 was introduced)

If that's the case, then: my bad on that misunderstanding - that was me trying to dig through history to find the "when", and that looked like the most likely change origin

are clearly documented in the release notes

Can you help me find them? The raw notes only mention ZRANGE (or "nested") once, and that is in 6.0.15 when it notes that an additional command (ZPOPMAX/MIN) has been modified to use jagged arrays. The github releases have 6.2-rc1 as the first release with actual content; I've been through the various pages, and the same 6.0.15 note (also duplicated in 6.2.5) is the only thing that mentions ZRANGE (I'm just using ZRANGE as an example here). If there's a third release notes location that cites these things, please point me at it so I can work through whatever it contains.

Comment From: oranagra

I'm referring to the 00-RELEASENOTES files in the various release branches, until recently we didn't add that to the GH releases. and also these release notes only became very detailed starting 6.0.6 (i can't take responsibility on what's before that). for grepping, i suggest to look for "breaking" or "behavior".

If you're looking for the change that made ZRANGE and XREAD nested, you won't find it, since that's part of the RESP3 feature of 6.0 (not a breaking change by my definition, since it's opt-in). but again, maybe i'm missing something, so let's first check (maybe empirically) if there is a change in these two commands between 6.0.0 and 7.2, and then set the right context for this discussion.

i.e. we can have (or should have) two discussions: * one about the fact that the change between RESP2 and RESP3 isn't as trivial as just changing array to map, and about improving the documentation (i think #10273 does exactly that, but it's not yet complete) * the other is about breaking changes in minor releases, avoiding them, and better documenting them retroactively.

so let's start by focusing on the first one, and after we conclude that, move to the second.

Comment From: guybe7

for the record: 470c2838 and c2e5be04 are the commits that changed ZRANGE (in unstable since 6.0-rc1)

Comment From: oranagra

these two commits are part of 6.0.0 GA, so that's not a compatibility issue

[edit] ohh, you mean that as a proof that this format existed since RESP3 was introduced and not later on.. well, i'm still unsure if i'm not missing what @mgravell complains about, and if indeed there were changes done later that i'm not aware of.

Comment From: mgravell

I'm referring to the 00-RELEASENOTES files in the various release branches

I believe the "raw notes" link above is the same content as that, but: here's the direct 6.0 link - I'm not seeing these "clearly documented" callouts; if I'm missing it, let me know a phrase to search for.

i suggest to look for "breaking" or "behavior"

I can see nothing directly relevant, except perhaps the aforementioned ZPOPMIN/ZPOPMAX, and certainly nothing that cites XREAD or ZRANGE in these breaking changes (or the many others impacted). Again, if I'm just not seeing it: please let me know a relevant passage to find.

and if indeed there were changes done later that i'm not aware of

it looks like ZPOPMIN/ZPOPMAX changes were added mid-release, but again: the main point is: I still can't find clear messaging to have expected the wider change in any of the release notes, protocol specs, or command documentation; I think I've handled all the breaking cases now, and that's fine for things handled inside the library, but: it does mean I have to be cautious about auto-enabling RESP3 in the library because I can't tell if a caller is using ad-hoc mechanisms to call impacted commands, in which case their code could break unexpectedly, which is a shame.

Comment From: mgravell

these two commits are part of 6.0.0 GA, so that's not a compatibility issue

My view is that changing the "major" shape (by "major" I mean the structural change, rather than just the metadata change) of an API response is a compatibility issue regardless of there being a protocol delta. Combine that with the command docs saying it returns something that it simply doesn't, at least in some scenarios.

Comment From: oranagra

I'm not seeing these "clearly documented"

and certainly nothing that cites XREAD or ZRANGE in these breaking changes.

You're not seeing them because they're not breaking changes (at least not in my eyes). Again unless i'm missing something, in which case please give an example of a reply that change and i can locate the change.

Your original complaint said that things got broken between server versions (or even in minor versions), and that there's no opt-in flag (sending HELLO 3 is an opt in flag), so if this whole discussion is about an unclear mapping between RESP2 and RESP3, please say so, and we can focus on that (that's not a breaking change IMHO).

Comment From: mgravell

I'm trying to have a discussion, not a "complaint", but: that's perhaps a semantic issue.

You're not seeing them because they're not breaking changes (at least not in my eyes).

OK, I acknowledge that we disagree on this. I consider them breaking changes.

However, the point I was responding to was that you previously said they "are clearly documented in the release notes"; now you're saying that they aren't mentioned at all, because they're "not breaking changes".

Sorry if I'm being pedantic here, but I genuinely don't see how a command suddenly returning a completely different shape of data - and a shape of data that disagrees with the documentation - can be anything other than a breaking change. And a completely undocumented breaking change.

things got broken between server versions (or even in minor versions)

If I was wrong about the time when ZRANGE changed shape, then fine; sorry for confusing things. Let's not get distracted by that (even if ZPOPMIN/MAX might qualify); the issue IMO is that the commands suddenly change shape to an undocumented one

sending HELLO 3 is an opt in flag

Well, yes and no. Because most consumers don't know about RESP at all - they have a library to deal with things like that, including handshakes. But either way, an opt in flag still needs to be clearly signposted as to what that flag changes, and right now: it isn't. There is no documentation anywhere that I can find that says "oh, and by the way, {these} commands will suddenly start returning a completely different and undocumented shape".

so if this whole discussion is about an unclear mapping between RESP2 and RESP3,

No, the mapping between RESP2 and RESP3 is perfectly clear. The problem is that structural changes outside of RESP2 vs RESP3 were made, at the same time, in an undocumented way. To emphasize: the jagged response from ZRANGE: perfectly expressable in RESP2; this is not a RESP2 vs RESP3 thing.

Ultimately, I think RESP3 was seen as a "line in the sand" opportunity to make some API revisions; personally I would have prioritised compatibility over everything else (and FWIW I don't see the jagged response as an improvement - it takes more bytes and involves a lot more parse processing; genuinely, I see the nested approach here as a misstep, but one that has happened), but: I genuinely do understand the motivation and temptation. The problem is that you don't consider these breaks, under the heading "well, they're already making a big toggle". OK, well if that is the intended philosophy (and let's face it: what's done is done): it needs to be clearly communicated and documented, so at a minimum, that is the biggest failure here. I still haven't found a list of what those breaks are. Have I caught them all? I don't know; maybe my test suite isn't as comprehensive as would be ideal.

Comment From: mgravell

The secondary part of what I'm trying to achieve by having this discussion, is to try to steer the server folks to think a bit more in terms of client compatibility. Redis historically has a pretty enviable record of compatibility and not making hard breaks. We've slammed into the wall on this occasion, and because I genuinely care about the redis ecosystem, I'm trying to help highlight "this combined set of decisions: was a problem that threw client compatibility under a bus; let's please work collectively to try to avoid this kind of outcome in the future".

As I tried to say earlier, I think there's 2 ideal outcomes here:

the set of breaking changes (and yes, I'm going to keep calling them that) needs to be collated and communicated (and updated on the command documentation pages)
more consideration is applied in future to avoid this kind of API break

Comment From: oranagra

Sorry if I'm being pedantic here, but I genuinely don't see how a command suddenly returning a completely different shape of data - and a shape of data that disagrees with the documentation - can be anything other than a breaking change. And a completely undocumented breaking change.

it's not suddenly, it's only as a result of the client requesting that change. i agree that the documentation is lacking (about this feature, not "a change", and not a sudden one).

regarding ZPOPMIN we did clearly document it in the release notes, but i agree we should also document it in the command docs, and also avoid such breaking changes in the future.

sending HELLO 3 is an opt in flag

Well, yes and no. Because most consumers don't know about RESP at all - they have a library to deal with things like that, including handshakes.

i'm a little bit out of my domain, but i'd say that the client library can permit using the RESP3 feature only after it can handle it, and all the changes that it requires. i'd also like to remind us that all of these RESP3 decisions (other than the ZPOPMIN one) were taken by Salvatore, so my arguments are only here retroactively.

so if this whole discussion is about an unclear mapping between RESP2 and RESP3,

No, the mapping between RESP2 and RESP3 is perfectly clear. The problem is that structural changes outside of RESP2 vs RESP3 were made, at the same time, in an undocumented way. To emphasize: the jagged response from ZRANGE: perfectly expressable in RESP2; this is not a RESP2 vs RESP3 thing.

i can argue that in this case the term "mapping between RESP2 and RESP3" is not just about the difference in protocol itself, but also about how redis implements commands, and indeed it causes redis to change a few other things that are not trivial, and should have been documented. Again, AFAIK the whole purpose for RESP3 was so that in many cases the client won't need to understand the command in order to forward the response to the app, the respond should stand for itself, and in that case it must have nested arrays rather than a flat one. But anyway, i wasn't "in charge" when all of this happened.

We are in agreement that the documentation is in desperate need of improvement, but not about the words commands suddenly change shape.

Comment From: oranagra

Taking a step backwards, i think the reason we disagree on ZRANGE response being a breakage is that when i think of RESP3 i don't think of it solely as a new low level protocol, but also a mode change in redis (meant for command replies to stand for themselves), and i acknowledge that a client must not enable it before being prepared to handle everything that changed (not just the low level protocol).

I certainly agree we need to improve the documentation, and i also agree we should do better to avoid breaking changes in the future, and think more about clients (i think we already doing better than we used to do, and we're aiming for more)

Aside from that discussion, i'll try to help mapping all the non-trivial changes in the immediate timing, so that you'll have a path forward.

Comment From: mgravell

But anyway, i wasn't "in charge" when all of this happened.

Please understand me; I am absolutely 100% not trying to make this about any one individually; I'm not trying to fingerpoint; what's done is done, I get that. My sole aim as a: to get the veil lifted on what did change silently, and b: to try to steer towards not repeating this. It sounds like you're on the same page here on both of these things, so: great!

Comment From: mgravell

but also a mode change in redis (meant for command replies to stand for themselves)

I have not seen any documentation that advises people of this (beyond the metadata details in the protocol specification)

and i acknowledge that a client must not enable it before being prepared to handle everything that changed (not just the low level protocol).

I have not seen any documentation that tells them what "everything that changed" is, so I'm not sure how anyone is expected to do that

Maybe these conversations happened internally, but they have simply not been adequately communicated

Comment From: oranagra

maybe you're right (i might be aware of some internal conversion, or maybe i extrapolated things to conclude others), but i imagine that's not the case, and everything i know about RESP3 was publicly posted. in any case, we can certainly try to improve that.

What i can easily find, is that Salvatore "blogged" that reasoning into the spec itself:

At the same time, RESP has flaws. One of the most obvious is the fact that it has not enough semantic power to allow the client to implicitly understand what kind of conversion is appropriate. For instance the Redis commands LRANGE, SMEMBERS and HGETALL will all reply an array, called in RESP v2 terms a multi bulk reply. However the three commands actually return an array, a set and a map. Currently, from the point of view of the client, the conversion of the reply to the programming language type is command-dependent. Clients need to remember what command was called in order to turn the Redis reply into a reply object of some kind. What is worse is that clients need to know about each command, or alternatively provide a lower level interface letting users select the appropriate conversion.

I.e. What i understand is that the purpose is for the client library to be able to handle the reply without intimate knowledge of the command, and that means to me that ZRANGE WITHSCORE should return a nested data structure. p.s. in Redis 7.0 we went further with that by extending the COMMAND (DOCS and INF) command in an attempt to make module commands first-class citizens and help client libraries rely less on hard coded logic and mode about auto discovery.

@itamarhaber maybe you know of another place where the philosophy behind RESP3 was described? @uglide do you know who to pull into this discussion in order to drive some work on improving the docs?

Comment From: uglide

@mgravell Please use this draft as a guideline for changed reply schemas in RESP3 https://docs.google.com/document/d/1ONG0yq7PiSop62GndTm4qcFKDmbLJzSG9KU9XUa3hhE/edit?usp=sharing

Comment From: mgravell

Re Salvatore's blog: that example isn't a structural change, though; the array vs set is just *6 vs ~6; the array vs map is just *6 vs %6; but the data is the same fundamental shape and parsing the values works identically, to the observer. It is a semantic/metadata change, sure, but we can usually overlook such niceties.

and that means to me that ZRANGE WITHSCORE should return a nested data structure.

Personally I think that was a mistake; redis already had a familiar interleaved pattern. I understand that you didn't want to describe this as a "map" due to the ordering concern. With hindsight, I would have simply proposed a spec change (which happened any when "map" was flipped between "ordered" and "unordered", which is a thing that happened) that added a new token that works the same as % but with ordering; maybe just "ordered pairs". Oh well, that ship has sailed.

by extending the COMMAND

Since that is only available in Redis 7, most general clients can't rely on that approach, because they don't know that they'll be using Redis 7 (people migrate slowly). And adding a big "fetch all the commands, parse all the semantics, now build some strategies" at runtime would represent a huge handshake cost (per reconnect), and somewhat defeat the purpose of having well-defined commands. IMO.

Comment From: oranagra

Re Salvatore's blog: that example isn't a structural change, though

ignore the example and look at the justification: Clients need to remember what command was called in order to turn the Redis reply into a reply object of some kind

Anyway, the evidence are that he wrote that paragraph (not an internal conversation), and also implemented ZRANGE the way he did. it's futile to keep arguing about it, what we can do now is see what we can do to improve the docs.

the reason i mentiones COMMAND is just to indicate that we're also working in that direction, which is to let client libraries use less hard coded logic, and more automatic one. time will tell if it'll really happen...

Comment From: oranagra

i'm gonna close this issue now, stating that in light of the above comment HELLO 3 sets redis into a different response mode, not just changing a low level protocol, and in that light these changes in the replies are not sudden breaking changes.

considering that seems to have been the intention back in 6.0, and in any case it is not possible to change now, this long discussion isn't productive.

i urge people in this discussion to open productive suggestion issues or PRs in the redis-docs repo. i'll try to assist with pushing that forward myself too if i can.

Thank you for raising attention to this.

Comment From: NickCraver

@oranagra With respect, this should remain open until the docs are addressed. I think there's too much focus here on what was breaking/broken/when/whatever. I personally agree the situation isn't ideal, but also that the history matters much less than having correct docs for these. At the moment, by the documentation, turning on RESP3 yields responses that aren't in the docs. That's what is, right now, what's broken and hard to validate/build for.

Let's agree the fixes are in the docs, and the current issue is the docs don't match, but that should be open and tracked. As library authors, we need this. Please note that users are allow to call "Execute" in our library to run a raw command, so to them the shape can and does matter too (and especially in the LUA case, which is a primary reason we can't automatically turn on RESP3 any more like we planned).

Can we please leave this open, or make a documentation issue? The draft doc is great and I'm glad it's open, but we need some notice on when this is done so we can validate, and I imagine that'd be useful to any library author or anyone dealing with the API. An issue with updates is the proper place for that.

Comment From: oranagra

i agree it should be tracked, but it should be tracked in the redis-doc repo not here. also, the issue we open there should be focused on listing what should be done going forward (what's missing in the docs), and not about a debate on how things were done in the past. with the current title (breaking changes), description (about 6.2 breakage that's untrue), and lots of debates about whats a breaking change, and what was the intent of the author, this (long to read) issue isn't really helpful for setting the tasks going forward.

please open an issue in redis-doc, with either a list of suggested actions, or a list of things that are unclear, and add a reference to this one. if you rather that i'll open it, i can. i just feel it may be better if you guys do it.

Comment From: mgravell

and especially in the LUA case, which is a primary reason we can't automatically turn on RESP3 any more like we planned

@NickCraver actually I don't think Lua is impacted, unless there's an additional opt-in flag you need to invoke inside Lua; I believe my Lua tests on RESP3 resulted in the original RESP2 data types. I'm not 100% confident on this, and I'm happy to take clarification / context. Execute is definitely impacted, though.

Comment From: mgravell

@NickCraver update (I already had this in my tests, but had forgotten): Lua is not impacted unless the script includes redis.setresp(3)