Hello,
It seems that, since a while, the "+" character in a raw query parameter input is not being properly detected and encoded by UriComponents.
I've managed to track it down to this being the root cause: https://github.com/spring-projects/spring-framework/commit/f2e293aadf98cb64ba7ef3044b59f469efd21503
Before:
QUERY_PARAM {
@Override
public boolean isAllowed(int c) {
if ('=' == c || '+' == c || '&' == c) {
return false;
}
else {
return isPchar(c) || '/' == c || '?' == c;
}
}
}
After:
QUERY_PARAM {
@Override
public boolean isAllowed(int c) {
if ('=' == c || '&' == c) {
return false;
}
else {
return isPchar(c) || '/' == c || '?' == c;
}
}
}
My best guess so far is that this change was targetting already encoded query params, as can be seen by the Test class added in the commit, but inadvertidly fails to consider raw inputs -- in particular, for my case, ISO date-time.
h3. Example
UriComponentsBuilder.fromHttpUrl(path)
.queryParam("time", java.time.OffsetDateTime.now())
.build()
.encode(StandardCharsets.UTF_8) // I am not sure if this explicit "encode" is necessary or not as it fails regardless
.toUriString();
With the above, the "raw" time will be something similar to 2023-10-04T17:30:49.4301344+02:00
.
Since the +
character in an URL is just a space, this will then have the final value of 2023-10-04T17:30:49.4301344 02:00
, which the server is unable to understand.
Note that if you are naturally in a -
timezone, then it would work fine -- just input the date as a String then, if you would like to verify this.
The current behaviour is something I would expect when using build(encoded: true)
, which is, in my humble opinion, what the Test in the above mentioned commit is actually validating.
UriComponentsBuilder.fromHttpUrl(path)
(...)
.build(true)
.toUriString();
Sorry in advance if something is not clear here. Please let me know in case you would like any further details and I'll gladly provide them.
Comment From: bclozel
The Spring Framework code you're pointing at is indeed linked to #19394. +
is indeed allowed in the query parameter.
For this case, you should know that only URI template variables are strongly encoded; the other parts of the URI template are simply checked for allowed characters. Here, "+" is allowed so it's not encoded.
I've tested this successfully:
URI uri = UriComponentsBuilder.fromHttpUrl("https://example.org/test")
.queryParam("time", "{time}")
.encode()
.buildAndExpand("2023-10-04T17:30:49.4301344+02:00")
.toUri();
assertThat(uri).asString()
.isEqualTo("https://example.org/test?time=2023-10-04T17%3A30%3A49.4301344%2B02%3A00");
URI encoding can be tricky, and we have a dedicated section in the reference docs for that.
Comment From: blaghed
Hi @bclozel ,
Thank you very much for the quick reply, though I am sad that it was closed just as quick. I am not sure on the proper process to report a bug, so I apologize if this is going against the grain, but kindly allow me to attempt to refute your argument as I do not agree with it.
First: yes, I can indeed use buildAndExpand
to work around the issue, which is what I will be doing now. However, I consider that to be just a workaround, not a fix.
Consider the following:
URI uri = UriComponentsBuilder.fromHttpUrl("https://example.org/test")
.queryParam("var", "1 + 1 = 2")
.encode()
.build()
.toUri();
assertThat(uri).asString()
.isEqualTo("https://example.org/test?var=1%20+%201%20%3D%202");
The ending result of https://example.org/test?var=1%20+%201%20%3D%202"
is incorrect because, in a URL, +
is a reserved character that signifies "space", same as using %20
. Yes, it is a valid character, but it is also an incorrect one to not encode in such situations where you attempt to auto-detect if encoding is needed.
As it stands, the server will receive var: 1 1 = 2
instead of var: 1 + 1 = 2
.
Now, if you want to say that "+" is intended to be a space, then it is my perspective that such a value is already encoded, and as such should be declared like this:
URI uri = UriComponentsBuilder.fromHttpUrl("https://example.org/test")
.queryParam("var", "1+%2b+1+%3D+2")
.encode()
.build(true)
.toUri();
assertThat(uri).asString()
.isEqualTo("https://example.org/test?var=1+%2b+1+%3D+2");
So, by all of this I simply mean that, logically, either all of the query parameters should be considered to be entirely encoded, or not encoded at all, instead of just partially-encoded as it works now.
It is of course completely ok for you to disagree with this perspective, but I would be very thankful if you could explain the reasoning behind it so I can get in better sync with you as a user of your framework.
Comment From: bclozel
First: yes, I can indeed use buildAndExpand to work around the issue, which is what I will be doing now. However, I consider that to be just a workaround, not a fix.
This is not a workaround, but rather a way to signal the intent behind the query param value you're giving to the uri builder.
The ending result of https://example.org/test?var=1%20+%201%20%3D%202" is incorrect because, in a URL, + is a reserved character that signifies "space", same as using %20. Yes, it is a valid character, but it is also an incorrect one to not encode in such situations where you attempt to auto-detect if encoding is needed.
We don't attempt to auto-detect if encoding is needed, we expect you to know that depending on how you're providing the information. Without that, there is no way for us to know if the +
char in the query param value is already encoded or not. When providing the query param value directly expanded in .queryParam
, we must assume that it's been already encoded according to the RFC rules for this part of the URL.
Decoding the value you've provided "1+%2b+1+%3D+2"
yields "1+++1+=+2"
. This is mixing encoded and unencoded chars - there's just no way for a framework to guess which one is which. If we undo the change you're pointing at, existing code sending .queryParam("message", "hello+spring")
will be encoded to "hello%2Bspring"
and then decoded to "hello+spring"
. How are applications supposed to send a +
character then? Should we apply the same reasoning to /
and ?
then?
Please have a look at all issues linked to #19394. This is not the first time we're having this discussion. We're trying to avoid as much as possible complexity but we can't hide the fact that this is how URLs work.
Comment From: blaghed
Hi again @bclozel ,
Thank you very much for taking the time to explain further, it is very much appreciated. I do not mean to be contentious, but kindly consider the following counter-arguments.
We don't attempt to auto-detect if encoding is needed
Yes, perhaps "auto-detect" is a bit of a stretch on my side, apologies.
What I meant is that the "intent" to have the params encoded or not is declared via the .build(true|false)
portion of the URI construction.
After your mention of the additional characters, I took the liberty of experimenting with some other characters that are explicitely defined on RFC 3986, and got this result:
URI uri = UriComponentsBuilder.fromHttpUrl("https://example.org/test")
.queryParam("var", "+$!&:/?")
.encode()
.build()
.toUri();
assertThat(uri).asString()
.isEqualTo("https://example.org/test?var=+$!%26:/?");
With this we see that &
was encoded, even though it is in the exact same category as +
, !
and $
on the RFC, but these were not encoded.
My theory for this, and please correct me if I am mistaken, is that Spring realized that &
has the additional special effect of representing a break between query params, while !
and $
can be safely left in a raw state.
+
, in my humble opinion, should receive the same preferential treatment as we see with characters &
and =
, as it also has the special meaning of "blank space".
Decoding the value you've provided "1+%2b+1+%3D+2" yields "1+++1+=+2".
This is absolutely correct, if you look only at how your own decoder works.
However, how a URL works is that +
is a space, and therefore the actual decoded value (and this is what a server will receive) is 1 1 = 2
.
Comment From: bclozel
My theory for this, and please correct me if I am mistaken, is that Spring realized that & has the additional special effect of representing a break between query params, while ! and $ can be safely left in a raw state.
Now we're simply following the spec.
Comment From: blaghed
Hi @bclozel ,
I have to admit that I do not see which part of the spec you are referring to that states this is the appropriate behaviour, but fair enough that I may simply be failing to grasp it.
As you stated, it was already discussed in another thread when the change was originally contemplated. Since I disagree with the direction that discussion took, as I feel it failed to properly address this concern, I was attempting to re-discuss it here with a fresh perspective. Additionally, the current behaviour has been here for a while, and the law of inertia is strong in our hearts ;)
I apologize that I could not make my arguments in a convincing enough way, perhaps my perspective that "this is too clearly a bug" is making me run afoul and causing me to fail in properly conveying the message that would help you consider it from this side.
Regardless, thank you very much for the replies. Perhaps in the future, should you feel differently, then this can be tabled again. Cheers!
Comment From: bclozel
Thanks for the discussion!
I meant this:
query = *( pchar / "/" / "?" )
This lists characters that are allowed in the query parameters. The code you were pointing is strictly following that.
Comment From: blaghed
Sorry, but I still do not understand that point.
query = ( pchar / "/" / "?" ) pchar = unreserved / pct-encoded / sub-delims / ":" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "" / "+" / "," / ";" / "="
However, &
and =
are encoded, while +
is not. Am I simply miss-reading how they are grouping the chars?
Comment From: bclozel
However, & and = are encoded, while + is not. Am I simply miss-reading how they are grouping the chars?
&
and =
are allowed in the query, but they have special meaning don't they? If we don't encode those, how could the server parse the query?
For example:
* with https://example.org?project=spring%26framework&version=6
query params are {"project":"spring&framework","version":"6"}
* with https://example.org?project=spring&framework&version=6
query params are {"project":"spring","framework"="","version":"6"}
Comment From: blaghed
& and = are allowed in the query, but they have special meaning don't they? If we don't encode those, how could the server parse the query?
Yes, exactly! And just like those, +
also has a special meaning, so I am trying to defend that it should be handled the same.
I believe there may have been some misinterpretation of the RFC at some point that led to +
being removed from the special status it had on Spring.
My own interpretation is that the RFC lists the characters that are allowed for the query component of a URL, and that by that it means http://example.org?<here>
, not http://example.org?<here>=<here>&<here>=<here>
.
So, in that list are included characters that have special handling, like &
, =
and +
.
Of course, all this means is that the characters are accepted, but the meaning behind them still has to be taken into consideration, including the treatment of +
as a blank space.
Sorry if that sounded too-high-energy, just a bit excited that we managed to discuss so far.
Comment From: bclozel
I believe there may have been some misinterpretation of the RFC at some point that led to + being removed from the special status it had on Spring.
This was a bug that was fixed in #19394, not special status. Sorry, but I don't think this discussion is going anywhere as there is no way to bring the behavior you'd want without bringing that issue back.
Comment From: blaghed
Hi,
Sorry, didn't meant to lose you there, just got a bit ahead of myself.
Discussing https://github.com/spring-projects/spring-framework/issues/19394 in particular then.
Please note that I respect that a fix was attempted for it, and value the time invested in doing so. My argumentation is simply that the fix is, unfortunately/accidentally/understandably, incorrect.
Additionally, when the point regarding +
was brought up in the discussion, the answer to justify doing this was not substantiated.
I don't mean to injure anyone with this, I've certainly made much worse coding errors so it is all very relatable. Please read the remainder of my argumentation as I try to explain my conclusion.
Notice the original report here:
'+' is valid for a space but with the following code:
java String httpUrl = "http://localhost:8080/test/print?value=%EA%B0%80+%EB%82%98"; URI uri = UriComponentsBuilder.fromHttpUrl(httpUrl).build(true).toUri(); System.out.println(uri);
I got the following error: ```java java.lang.IllegalArgumentException: Invalid character '+' for QUERY_PARAM in "%EA%B0%80+%EB%82%98"at org.springframework.web.util.HierarchicalUriComponents.verifyUriComponent(HierarchicalUriComponents.java:313) at org.springframework.web.util.HierarchicalUriComponents.verify(HierarchicalUriComponents.java:281) at org.springframework.web.util.HierarchicalUriComponents.
(HierarchicalUriComponents.java:90) at org.springframework.web.util.UriComponentsBuilder.build(UriComponentsBuilder.java:336) at learningtest.org.springframework.web.UriComponentBuilderTests.testFromHttpUrlBuildEncoded(UriComponentBuilderTests.java:19) ```
What is being reported here is that the verification of the URL is incorrect.
Furthermore, notice that they have build(true)
on their example, which is stating that it is already encoded, so +
should indeed be treated and accepted as an encoded "space" in this situation, and that verification should have passed.
So, is there a bug? Yes, absolutely!
As for the fix, what was decided upon was that +
will simply be dropped from consideration a special character entirely.
Here is where I believe it went wrong. The issue was not in the URL itself, but rather in the validation process that should have accepted +
as part of an encoded input.
So, at this point, +
can only ever be treated as a encoded space, even when the code explicitely states that the params are not yet encoded via build(false)
.
Then, during the discussion, someone accurately brought up +
as a concern, which was refuted by linking to RFC 3986.
However, this RFC doesn't actually specify a syntax on URI's "query" component, only the characters that are allowed.
For example, according RFC 3986, http://example.org?bar123
is just as valid as http://example.org?foo=1&bar=2
. The meaning we associate to &
, =
and +
are actually not a part of this RFC at all.
So, it is a fallacy to use this RFC as a justification to remove the special handling for any of these 3 characters, because it is not this RFC that makes them special at all.
Now would be a good time to link you the actual specification for the "&key=value" form we most commonly use, if only I could find it 🐑
However, I believe you can agree with me that, in Spring's "query" approach, you follow that specification, and as such should give the +
character the same treatment as you do &
, since it is part of your chosen special syntax.
To accomplish this, you already have the information flag present telling you if it was encoded at input or not, so the code should already clearly know if when I give input value var: 1+1
, do I mean var: 1%2b1
or do I mean var: 1 1
.
Comment From: blaghed
Hi @bclozel , I believe it is possible to solve this without bringing in any regression, but it may be a bit of an effort as some re-write is needed on validation side. I don't mind sinking in time on a PR to try and address it, but since this was closed so quickly I am a bit reluctant to do so as it seems the status-quo is fine for most. Let me know if you consider this to be an issue, and I'll try to provide a PR suggestion.
Otherwise, thanks anyway for the discussion. Cheers!
Comment From: ae-govau
FWIW I just spent the last few hours tracking this down. Interestingly we noted this after our update from Spring 2 -> 3 - even though it seems this odd behaviour has been around a while so I'm not sure why it has just hit us. We've changed our code to use the Apache URIBuilder which is pretty much a drop-in replacement for bit we want and seems to encode correctly.
Having read much of discussion above I'll throw a few dot-points in...
RFC 3986 section 2.2 lists '+' as a reserved character and goes on to state:
URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component.
This component of the URI is the query component which is in section 3.4, however as I think someone asked above, where is a spec for what a query string looks like? e.g. how are we sure that &
and =
are allowed by the URI scheme to represent data in that component?
The best reference I can find for this is not an RFC, but rather it's in the HTML spec, which kind of makes sense - as it came from the <form>
tag submission. https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 states:
application/x-www-form-urlencoded
This is the default content type. Forms submitted with this content type must be encoded as follows:Control names and values are escaped. Space characters are replaced by
+', and then reserved characters are escaped as described in [[RFC1738]](https://www.w3.org/TR/html401/references.html#ref-RFC1738), section 2.2: Non-alphanumeric characters are replaced by
%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e.,%0D%0A'). The control names/values are listed in the order they appear in the document. The name is separated from the value by
=' and name/value pairs are separated from each other by `&'.
Note RFC1738 is an older (obsoleted) version of RFC3986.
(an even older reference from 1993 is here, although it doesn't contemplate +
encoding: https://www.w3.org/MarkUp/HTMLPlus/htmlplus_42.html)
So I think perhaps that answers the question as to why &
and =
are allowed unescaped within that overall query component (however of course must be encoded within a value).
We've switched to different library for this section, but wanted to chime in with support for the position that it is unhelpful to not encode +
since every query parser that I'm aware of will read that back in as a space, and there's not much point serialising data that will be deserialised incorrectly.