In some specific cases, HTTP clients can eagerly close connections with the server right after the response has been received, but before the HTTP exchange is considered complete. This can result in duplicate observations being recorded, one with the "SUCCESSFUL"
outcome and another one with the "UNKNOWN"
outcome (as the exchange was cancelled).
This can be reproduced with the following:
- a web handler that introduces some delay and schedules the handling on a different thread
- a client that closes connections eagerly (for example, this has been reproduced locally with apache bench with a 1/10K ratio)
- a low latency setup
"Duplicate metrics" can be explained by:
- both "COMPLETE" and "CANCEL" reactive streams signals racing
- the reactive streams spec not preventing this case, so
doOnCancel
can be called afterdoOnTerminate
- the Micrometer
Observation
API not preventing multipleObservation#stop()
calls; this means observation handlers are called multiple times
We cannot ignore CANCEL signals in our instrumentation, as this would leak started observations and would not count all valid cases of cancellations. We should instead refine our instrumentation with the Mono#tap
operator and locally guard against this case.
This needs to be applied on the reactive ServerHttpObservationFilter
and the HttpWebHandlerAdapter
instrumentation that replaces it in Spring Framework 6.1.