Hello. I have some questions about web observation issues. We updated from Spring Boot 2.x to Spring Boot 3.1.4 and faced the problems with metrics. After investigating we found related issues in micrometer-metrics and spring-web (for example, https://github.com/spring-projects/spring-framework/issues/31417, https://github.com/spring-projects/spring-framework/issues/31388, https://github.com/spring-projects/spring-framework/issues/27587, https://github.com/micrometer-metrics/micrometer/issues/3874). Updating versions solved our problems.
But we found several places, where problems can arise (maybe we are wrong). Can you please have a look at them?
Thanks!
https://github.com/spring-projects/spring-framework/blob/489d18a1691d00c7487ec6034f7065da28fcb3b8/spring-web/src/main/java/org/springframework/web/server/adapter/HttpWebHandlerAdapter.java#L416-L426 - should we stop observation if response is null? (same in https://github.com/spring-projects/spring-framework/blob/489d18a1691d00c7487ec6034f7065da28fcb3b8/spring-web/src/main/java/org/springframework/web/filter/reactive/ServerHttpObservationFilter.java#L160)
https://github.com/spring-projects/spring-framework/blob/489d18a1691d00c7487ec6034f7065da28fcb3b8/spring-web/src/main/java/org/springframework/http/server/reactive/AbstractServerHttpResponse.java#L64 - volatile HttpStatusCode statusCode?
https://github.com/spring-projects/spring-framework/blob/489d18a1691d00c7487ec6034f7065da28fcb3b8/spring-web/src/main/java/org/springframework/http/server/reactive/AbstractServerHttpResponse.java#L75 - volatile HttpHeaders readOnlyHeaders?
https://github.com/spring-projects/spring-framework/blob/489d18a1691d00c7487ec6034f7065da28fcb3b8/spring-web/src/main/java/org/springframework/http/server/reactive/observation/ServerRequestObservationContext.java#L53-L55 - volatile String pathPattern, volatile boolean connectionAborted?
https://github.com/spring-projects/spring-framework/blob/489d18a1691d00c7487ec6034f7065da28fcb3b8/spring-web/src/main/java/org/springframework/http/server/reactive/observation/DefaultServerRequestObservationConvention.java#L162-L163 - same logic as in https://github.com/spring-projects/spring-framework/blob/489d18a1691d00c7487ec6034f7065da28fcb3b8/spring-web/src/main/java/org/springframework/http/server/reactive/observation/DefaultServerRequestObservationConvention.java#L119-L120?
Comment From: olga-larina
And I have a question about "active" metrics (for example, http.client.requests.active or http.server.requests.active). Should I create another issue for it?
Observation is started at the beginning of the request (server or client). And at this time LongTaskTimer for "xxx.active" metrics with tags is created. But at this time we have only default tag values (for example, http status code) from org.springframework.http.server.reactive.observation.DefaultServerRequestObservationConvention or org.springframework.web.reactive.function.client.DefaultClientRequestObservationConvention. And onStop tags in LongTaskTimer are not updated. Is it a problem? Or that is how it should be?
Related code on micrometer:
https://github.com/micrometer-metrics/micrometer/blob/23b6c43d9ce7a758dd5aa4620c776358f3a86039/micrometer-core/src/main/java/io/micrometer/core/instrument/observation/DefaultMeterObservationHandler.java#L72-L79
Comment From: snicoll
Thanks for getting in touch, but it feels like this is a question that would be better suited to Stack Overflow. As mentioned in the guidelines for contributing, we prefer to use the issue tracker only for bugs and enhancements. Feel free to update this issue with a link to the re-posted question (so that other people can find it) or add some more details if you feel this is a genuine bug.
If you intended to report an issue, then this should be one issue with instructions to reproduce. As it stands, this issue is not actionable.
Comment From: bclozel
Also see https://github.com/micrometer-metrics/micrometer/issues/5147
As Stéphane pointed out already, you can provide a sample application that demonstrates the problem, if there is one. As for volatile
members, it's only useful if those are read/written concurrently, which is not the case here. I don't think the previous generation of instrumentation was using those, did you encounter an issue in the past with it? Feel free to share a link to your question on StackOverflow and I'll have a look.
Comment From: olga-larina
This is the example which shows that xxx_active metrics are created only with default values from DefaultServerRequestObservationConvention and DefaultClientRequestObservationConvention (because they are created on start of the observation). (Use App.kt to start the application, TestCheck.kt to query test endpoint, curl localhost:8080/actuator/prometheus
to view metrics)
Also it shows that fields in AbstractServerHttpResponse and ServerRequestObservationContext can be accessed from different threads. This is reproduced by using publishOn in the custom filter (TestFilter), but without it some fields are also read/written from different threads. You can see it in debug mode. There has already been the concurrency issue in AbstractServerHttpResponse https://github.com/spring-projects/spring-framework/issues/27587, where commitActions were changed to CopyOnWriteArrayList.
If you call TestCheck with numThreads=100 and numRequests=10000 for example, the metrics with outcome="UNKNOWN" are created (numThreads and numRequests can vary). This is the same issue as https://github.com/spring-projects/spring-framework/issues/31388 for status code. But the changes were made only for status. That's why if the connection is aborted, outcome becomes UNKNOWN.
All of these points are reproduced in Spring boot 3.1.10 and 3.2.4.
I don`t know how to reproduce null response in https://github.com/spring-projects/spring-framework/blob/489d18a1691d00c7487ec6034f7065da28fcb3b8/spring-web/src/main/java/org/springframework/web/server/adapter/HttpWebHandlerAdapter.java#L416-L426 and https://github.com/spring-projects/spring-framework/blob/489d18a1691d00c7487ec6034f7065da28fcb3b8/spring-web/src/main/java/org/springframework/web/filter/reactive/ServerHttpObservationFilter.java#L160 But it looks like observation should be stopped here, because it is started in doFirst.
Can you please have a look? Or I should publish it on StackOverflow?
Comment From: bclozel
This is effectively conflating 3 different problems in a single issue, we'll try and work from here.
Comment From: bclozel
This is the example which shows that xxx_active metrics are created only with default values from DefaultServerRequestObservationConvention and DefaultClientRequestObservationConvention (because they are created on start of the observation). (Use App.kt to start the application, TestCheck.kt to query test endpoint, curl localhost:8080/actuator/prometheus to view metrics)
It's not ideal, but this is the expected behavior. Long task timers collect tags as soon as the obversation is started. In the case of HTTP server observations, nothing much is available at that point besides the HTTP request basics. The only way to collect more data would be to start the observation only when the request is mapped to a controller method - this means that we would not measure HTTP routing for regular timers - that would be incorrect in my opinion.
Micrometer made it possible to disable such timers and Spring Boot has a dedicated option for this. "Active" timers make sense in many cases, but I agree that here this has little value.
Comment From: bclozel
Also it shows that fields in AbstractServerHttpResponse and ServerRequestObservationContext can be accessed from different threads. This is reproduced by using publishOn in the custom filter (TestFilter), but without it some fields are also read/written from different threads. You can see it in debug mode. There has already been the concurrency issue in AbstractServerHttpResponse https://github.com/spring-projects/spring-framework/issues/27587, where commitActions were changed to CopyOnWriteArrayList.
It's expected to have objects read/written from different threads as this is the concurrency model chosen by our Reactive stack. State should not be read/written concurrently by different threads though.
27587 did not solve a concurrency problem, see https://github.com/spring-projects/spring-framework/issues/27587#issuecomment-1772644969
31388 did not solve a concurrency issue either, we have just refined the "UNKNOWN" outcome to be only triggered if the response was not fully written to the network. This reduces the amount of "false positives" in metrics. As you can see, the fix did not change concurrency visibility for state nor switch to a concurrent collection implementation.
If you call TestCheck with numThreads=100 and numRequests=10000 for example, the metrics with outcome="UNKNOWN" are created (numThreads and numRequests can vary). This is the same issue as https://github.com/spring-projects/spring-framework/issues/31388 for status code. But the changes were made only for status. That's why if the connection is aborted, outcome becomes UNKNOWN.
Indeed, it seems that using a WebFilter
like the following changes the behavior:
@Component
class TestFilter: WebFilter {
private val executorService = Executors.newFixedThreadPool(2)
private val scheduler = Schedulers.fromExecutor(executorService)
override fun filter(exchange: ServerWebExchange, chain: WebFilterChain): Mono<Void> {
return chain.filter(exchange).publishOn(scheduler)
}
}
I don't believe this is due to concurrent modifications though. In debug mode, I can see that org.springframework.web.server.adapter.HttpWebHandlerAdapter.ObservationSignalListener#doOnCancel
is being called. This means a CANCEL signal is flowing through and we interpret that as the HTTP client closing the connection before the server had a chance to complete the exchange. In that sense, adding volatile
to various places would not fix anything as this is not due to concurrent read/writes of state. The observation convention is not reading outdated state, the observation context is rightfully updated because of a signal - the observation is stopped right after on the same thread and metrics information is collected right away.
Scheduling the handling of the HTTP exchange on an executor (and not on the Netty workers) is quite surprising to me and this could be the cause of the problem here, especially on a bounded executor. Maybe @simonbasle or @rstoyanchev have an opinion about this?
Comment From: bclozel
Now about the last point in this issue
I don't know how to reproduce null response in https://github.com/spring-projects/spring-framework/blob/489d18a1691d00c7487ec6034f7065da28fcb3b8/spring-web/src/main/java/org/springframework/web/server/adapter/HttpWebHandlerAdapter.java#L416-L426 and https://github.com/spring-projects/spring-framework/blob/489d18a1691d00c7487ec6034f7065da28fcb3b8/spring-web/src/main/java/org/springframework/web/filter/reactive/ServerHttpObservationFilter.java#L160
But it looks like observation should be stopped here, because it is started in doFirst.
This is a nullability refinement introduced in https://github.com/spring-projects/spring-framework/commit/c531a8a7058d24a110b312cd9a7a8bcf306845af#diff-564f3e4af414eeaaef0b11ac1d0394443b248683e5000bde0b225576ab991c8c. Effectively, at this point, if the observation has been started, the response cannot be null. So unless you can demonstrate a concrete use case where a problem happens with this, there is nothing to be fixed in my opinion.
Comment From: olga-larina
31388 did not solve a concurrency issue either, we have just refined the "UNKNOWN" outcome to be only triggered if the response was not fully written to the network. This reduces the amount of "false positives" in metrics. As you can see, the fix did not change concurrency visibility for state nor switch to a concurrent collection implementation.
Yes, it did not solve a concurrency issue. This issue fixed UNKNOWN status. But the changes were made only for status (not outcome). That's why if the connection is aborted, outcome becomes UNKNOWN. Is it ok?
Comment From: bclozel
Yes, it did not solve a concurrency issue. This issue fixed UNKNOWN status. But the changes were made only for status (not outcome). That's why if the connection is aborted, outcome becomes UNKNOWN. Is it ok?
No, it's the opposite. If the server receives a CANCEL signal, this means the connection has been closed/aborted by the client. As a result, we mark the observation context as "connection aborted" and this is later reported as "outcome unknown" in the observation. #31388 merely reduced the number of those by being more lenient: we consider that "outcome: UNKNOWN" should happen when the connection is closed by the client and the response has not been flushed already. In your case, your TestFilter
my trigger this case very often because of how it's scheduled. It doesn't mean the information is wrong, though.
Comment From: olga-larina
No, it's the opposite. If the server receives a CANCEL signal, this means the connection has been closed/aborted by the client. As a result, we mark the observation context as "connection aborted" and this is later reported as "outcome unknown" in the observation. #31388 merely reduced the number of those by being more lenient: we consider that "outcome: UNKNOWN" should happen when the connection is closed by the client and the response has not been flushed already. In your case, your
TestFilter
my trigger this case very often because of how it's scheduled. It doesn't mean the information is wrong, though.
This logic was applied only for status, not outcome. Outcome is always unknown for aborted context. https://github.com/spring-projects/spring-framework/blob/e9fcb21d55c586062b7b74798a23e0dd283f22d0/spring-web/src/main/java/org/springframework/http/server/reactive/observation/DefaultServerRequestObservationConvention.java#L110-L117 https://github.com/spring-projects/spring-framework/blob/e9fcb21d55c586062b7b74798a23e0dd283f22d0/spring-web/src/main/java/org/springframework/http/server/reactive/observation/DefaultServerRequestObservationConvention.java#L153-L160
Comment From: bclozel
Sorry, in my previous comments I should have mentioned "status: UNKNOWN" instead of "outcome: UNKNOWN". This is what #31388 was about. If the response is committed, it means that we're confident that the response status would be what it is (so, not UNKNOWN) even if the client does not receive the response.
Now, you're asking about making the "ouctome" not "UNKNOWN" in those cases. If the connection is aborted, this can mean that: 1. the client read the response or part of the response and bailed out before the server had a chance to close the connection as it should 2. there was a network problem and the client could not receive the response entirely
From the server point of view, we cannot know which case it is. What "outcome" value would you use for this case? 1) would be "SUCCESSFUL", but 2) would definitely be a "FAILURE". Please advise.
Comment From: olga-larina
On the one hand it seems that "outcome" should correspond "status", because sometimes we receive the metrics with status=200 and outcome=UNKNOWN (in the plain code, without such filters as in the example). But on the other hand it should be unknown (and it may be useful to have this value).
Thank you very much for investigating my issue and explaining the details.
Comment From: bclozel
If I'm not mistaken, we have discussed and solved all questions here.