Version: Spring Boot 2.6.3.RELEASE with Webflux & Metrics
Scenario - Client aborts the connection before server sends a response. - Server request processing takes forever (server configured with read/write timeout). Server closes the connection
In both the above cases, metrics are generated. However they are stored with status = 200 implying success. Eg:
http_server_requests_seconds{app="test-v0",cid="",exception="CancelledServerWebExchangeException",host="xxx",method="GET",outcome="UNKNOWN",springBoot="2",status="200",uri="/slow",ver="0.0.1",quantile="0.5",} 0.872415232
Please see earlier issue https://github.com/spring-projects/spring-boot/issues/23606 for reference source code.
The correct status code for this situation is clearly not well defined. However by making it 200, the result looks like success if we monitor fot HTTP Status code. Something from the 400 series would suit this better.
Comment From: bclozel
The HTTP status doesn't imply success or failure from an application nor a client perspective, it just points to the HTTP response status. If the status was not set when the connection was interrupted, then we can only imply the default: 200 OK.
Note that the outcome tag is set to "UNKNOWN", which is really the case here: we don't know if the server took longer than the configured timeout on the client side, or if the client didn't need the response after all. This can happen a lot with scatter/gather use cases with WebClient.
Here's the problem with setting HTTP 400 here:
- if the client has a really short timeout or cancelled because it didn't need the info after all, should we still count this as an invalid request?
- if server takes too long, maybe we would have hit a server timeout and a 500 status instead?
- the exception and outcome tags already describe quite well the situation: client disconnected and we don't know why
As discussed in https://github.com/spring-projects/spring-boot/issues/18444#issuecomment-541181924, there is no strong agreement there on the metrics side of things, and "UNKNOWN" is the recommendation so far. As for the HTTP status itself, I don't think we should change its meaning - we should either take what was set or leave the default.
Comment From: jtorkkel
In metrics / Grafana we normally use only "status", outcome is not used as "overlapping". outcome was added only in springboot2. In spingboot1 outcome was missing, at least from initial versions. Using both complicates all the queries because you must filter outcome="UNKNOWN", from success queries. To avoid extra filtering any status=4xx error code eliminate need to update tens of dashboards.
Comment From: bclozel
I completely understand the problem with dashboards, unfortunately there is no way to solve that without breaking other apps or making tags up from unknown data. I'm closing this issue for now, feel free to comment it with new information.
Comment From: s-volkov-1
Hello, I described related problem in stackoverflow post: https://stackoverflow.com/questions/69913027/webflux-cancelledserverwebexchangeexception-appears-in-metrics-for-seemingly-no TLDR: up to ~10% of requests to my server end up with CancelledServerWebExchangeException, according to prometheus metrics. But from the client perspective, everything with these requests looks fine.
Comment From: s-volkov-1
@bclozel could you have a look?