Affects: Any with UrlParser
The recently added org.springframework.web.util.UrlParser
is not spec compliant outside of the limited scope of modern browsers.
The living URL document at whatwg is incompatible with the IETF URI spec, Java itself, the Servlet spec, and various other non-browser use cases.
Users that want to use the new UrlParser
should not be using it for non-browser use cases (eg: REST clients).
Comment From: bclozel
Thanks for this feedback @joakime
Could you elaborate on the limitations of using this parser for REST clients? What kind of valid URLs wouldn't parsed correctly with this parser in the context of REST clients?
Thanks!
Comment From: joakime
Take for example a URL with no heir-part, or no authority.
http:///path.png
The UrlParser sees that as a host with path.png
Here's what java.net.URI does ...
$ jshell
| Welcome to JShell -- Version 17.0.11
| For an introduction type: /help intro
jshell> var uu = new URI("http:///path.png")
uu ==> http:///path.png
jshell> uu.getHost()
$2 ==> null
jshell> uu.getPath()
$3 ==> "/path.png"
Here's what java.net.URL does ...
$ jshell
| Welcome to JShell -- Version 17.0.11
| For an introduction type: /help intro
jshell> var uu = new URL("http:///path.png")
uu ==> http:/path.png
jshell> uu.getHost()
$2 ==> ""
jshell> uu.getPath()
$3 ==> "/path.png"
Those built-in parsers, along with the existing URL / URI parsers in various Servlet libraries follow the spec in RFC3986, which parses that as a URL with no authority, just a path.
WhatWG is good, if you want to follow browser behaviors, but not good outside of that limited scope.
There are more examples than just this, but just know that WhatWG isn't a great choice for the general internet behaviors, non-browser clients, http hardware, security tooling, caching servers, load balancers, etc ...
Comment From: joakime
There are many decisions in WhatWG that are designed to "clean up" bad behaviors from users, typos and whatnot (eg: eliminate duplicates, normalize away extra slashes, eliminate whitespace, etc)
Those are great choices for a browser dealing with HTML and Javascript, but not appropriate outside of a browser.
Comment From: rstoyanchev
Thanks for reaching out. This is an area we've been we've thinking about after CVE--2024-22243, CVE--2024-22259, and CVE--2024-22262 all of which relate to misaligned parsing choices between server and client.
For the specific example, the RFC syntax defines several choices for hier-part
. When there is an authority it starts with //
, but otherwise the path cannot begin with //
. So a URL without a host should be http:/path.png
, which is also confirmed by the algorithm in 5.3. So as far as I can see, http:///path.png
does not follow the syntax, and the question is how to parse it. There are many, many other similar ambiguous cases that arise in parsing, and the RFC has very little advice on how to parse. There is some in the section on comparing for equivalence, but nothing from the perspective of user typed or malicious input.
The CVE's highlighted this for us, and it's how we came across the WhatWG parsing algorithm backed by an extensive set of test cases, and providing an opportunity to align with browsers where security issues originate most of the time.
The extensive test cases shows just how massive the issue with parsing malicious input is, and the WhatWG algorithm provides a basis for alignment. There is little chance for clients and intermediaries to agree otherwise, and I'm sure they don't in many cases.
I'm interested in your feedback, but so far this looks like a good direction to us.
For documentation, I think it would be a challenge to advise to use UrlParser for browser clients only. It's not always easy to know the source for a URL that's being parsed, and even if it was known, it's still not possible to know what parsing choices the client would make, and how to align with that. We could mention that the algorithm is based on WhatWG and that it parses leniently many URLs that deviate from the spec and create ambiguity.
Comment From: joakime
For the specific example, the RFC syntax defines several choices for
hier-part
. When there is an authority it starts with//
, but otherwise the path cannot begin with//
. So a URL without a host should behttp:/path.png
, which is also confirmed by the algorithm in 5.3. So as far as I can see,http:///path.png
does not follow the syntax, and the question is how to parse it. There are many, many other similar ambiguous cases that arise in parsing, and the RFC has very little advice on how to parse. There is some in the section on comparing for equivalence, but nothing from the perspective of malicious input.
The RFC is very clear on this specific example, it is not ambiguous, or even a vague URI. That is a URI with a scheme, no-authority, and a path.
Ambiguous URIs do exist, and are documented in RFC3986, most of those are due to path section anomalies. (Anomalies that WhatWG do not address, but most servers do)
The CVE's highlighted this for us, and it's how we came across the WhatWG parsing algorithm backed by an extensive set of test cases, and providing an opportunity to align with browsers where security issues originate most of the time.
Interestingly, this topic, the differences in URI between the RFC (used by all internet hardware, nearly all internet servers, all non-browser user-agents) and WhatWG (used only by browsers) has come up recently on the ietf-http-wg
- https://lists.w3.org/Archives/Public/ietf-http-wg/2024JulSep/0281.html - see thread "Invalid Characters in URLs"
So far, only the Browsers vendors have advocated for WhatWG, the rest are still on RFC3986. There is some discussion that the RFC should be updated, but the lenient behaviors of the WhatWG are not desired on a Server side. The WhatWG document, being a living document, is also highly undesired from a spec point of view.
Comment From: rstoyanchev
The RFC is very clear on this specific example, it is not ambiguous
I think it is clear that it is invalid. I gave specific reasons. Could you clarify why you think otherwise?
Thanks for the pointer to the to ietf discussion. I will check it out.
Comment From: joakime
For the specific example, the RFC syntax defines several choices for hier-part. When there is an authority it starts with //, but otherwise the path cannot begin with //. So a URL without a host should be http:/path.png, which is also confirmed by the algorithm in 5.3.
The example you linked to RFC3986: Section 5.3 is for recomposition of a URI from components, not parsing it.
The example URI is http:///path.png
, and lets check the ABNF for it. - https://datatracker.ietf.org/doc/html/rfc3986#appendix-A
The overall ANBF is ...
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
The scheme
ANBF is defined as ...
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
... which makes the example URI scheme http
Next is the raw ":"
in the ABNF, that's obvious from the example URI.
Next is the hier-part
which has the ANBF of ...
hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
authority = [ userinfo "@" ] host [ ":" port ]
userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
host = IP-literal / IPv4address / reg-name
port = *DIGIT
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
The example URI of http:///path.png
is triggering the hier-part
simply because it starts with the characters "//"
.
We are obligated to parse the authority
followed by one of the path-abempty
or path-absolute
or path-rootless
or path-empty
.
We now parse the authority
by the ABNF rules, it has no [ userinfo "@" ]
and no host
and no [ ":" port ]
, hence it is empty. (The hier-part
cannot contain a /
character per ANBF rules)
Next, in our example, we have passed the authority (it was empty), and now we have to parse the remaining characters "/path.png"
as either path-abempty
or path-absolute
or path-rootless
or path-empty
(we don't know which it is until we start looking at the characters).
In what we have left of our example to parse we have ...
* path-abempty
is a match.
* path-absolute
is a better match.
* path-noscheme
is a terrible match, as we have parsed a scheme already
* path-empty
is not a match as we have at least 1 pchar
So we have a parsed URI path of "/path.png"
.
There's no "?"
character, so no query.
There's no "#"
character, so no fragment.
The hier-part
starting with "//"
is the key.
If there was no hier-part
, you would have a URI without an authority and without a path too.
Eg: http:?a=b
which is valid for use when resolving against a base URI (this could change the scheme, and set a query, on the base URI)
Comment From: markt-asf
A collection of related thoughts.
Browser handling of URIs has been inconsistent every time I've looked at it.
While I have seen statements that there are issues with RFC 3986 (and hence the need for another URI spec) I haven't yet seen a valid example of any such issue.
The Servlet spec references (and Tomcat follows) RFC 3986. Tomcat does have relaxedPathChars
and relaxedQueryChars
but those are very much "use at your own risk" and users and expected to understand the consequences for their infrastructure of using those options.
In hindsight, I'm not sure I advocated for the correct approach to path parameters with the Servlet spec as it creates some issues with reverse proxies. It might have been better to say path parameters are not removed and are treated like the rest of the segment for mapping purposes. But we are where we are.
I would expect to see http:///path.png
rather than http:/path.png
.
Getting back to the original point, I'd agree that some sort of warning that a URL parser doesn't follow RFC 3986 is a good idea.
Comment From: joakime
Some other thoughts to try to convey in the javadoc ...
WhatWG URL document is good if you are writing a user-agent (especially so if you are dealing with user provided URLs from things like the Location bar, user edited configuration, HTML, and Javascript).
RFC3986 is what you use if you want to use internet protocols that use URL/URIs. (eg: HTTP)
The WhatWG URL document is not a spec, and is subject to wild changes as the living document updates, what works today is not guaranteed to work in 2 years time. (This is the reality of the WhatWG URL document over its history. Even the test cases have changed dramatically, what was once allowed is no longer, and vice versa).
IMO the WhatWG URL document is trying to do too much. It should have been split up into separate documents.
- How to deal with user provided URLs (this is the lenient parsing behaviors)
- How to normalize / cleanup / produce a clean URL. (this includes scheme, authority, paths, query, required encoding, etc)
- How to resolve a clean URL against a base URL.
The RFC3986 addresses points 2 and 3, but not 1. That's because point 1 is exclusive for dealing with user provided URLs (as a browser, you deal with these from the location bar, badly created HTML, javascript mistakes, etc) Once you enter into the realm of internet protocols, you should be working with proper URL / URIs and RFC3986.
Comment From: markt-asf
Only a minor point but having looked more carefully at RFC 3986 I have reached the conclusion the http:/path
and http:///path
are equivalent.
Comment From: rstoyanchev
For http:///path
, section 3 talks about an authority being present or not, but I had overlooked that the syntax allows authority to be an empty string. I see now that it conforms to RFC 3986.
Interestingly, this topic, the differences in URI between the RFC (used by all internet hardware, nearly all internet servers, all non-browser user-agents) and WhatWG (used only by browsers) has come up recently on the ietf-http-wg
https://lists.w3.org/Archives/Public/ietf-http-wg/2024JulSep/0281.html - see thread "Invalid Characters in URLs"
This is very useful context. Greg's comment in particular matches the concerns in https://github.com/spring-projects/spring-framework/issues/33542#issuecomment-2376744425.
WhatWG URL document is good if you are writing a user-agent
This is not always easy to separate. For our own purposes, we parse only in our clients (RestClient, WebClient, etc), and when processing forwarded headers. We have no knowledge where the string came from.
More broadly, our URI parsing can be used in applications, in other Spring projects like Spring Security, or any other framework that choose to use it. The URI string may have been passed through the query or the request body, it may be parsed for validation, and then used in a redirect or included in a response to a browser, which leads to security issues.
Given the lack of certainty on how the URI string was prepared, whether it is malicious or not, if you can't be 100% certain and you don't want to reject it, then you need to parse it leniently, and for that WhatWG at least offers something to align with. Other clients with strict parsing will reject invalid URL's, and it least it won't lead to SSRF or other attacks.
IMO the WhatWG URL document is trying to do too much. It should have been split up into separate documents.
That sounds reasonable to me, but at the moment it's the only alternative. I hope there is eventually some movement in this space.
We'll discuss this as a team to decide on the way forward. There is still some time before the 6.2 release, so thanks for the timely feedback. I'll share more on that.
Comment From: rstoyanchev
@joakime we are going to add an RFC-based parser, taking inspiration from Jetty's HttpUri
, which is more recent than java.net.URI
. Thanks for bringing this up.