Affects: 5.1.15.RELEASE
Summary of issue
We sporadically see issues with "stuck threads" in a WebFlux service, where one thread becomes blocked while trying to read the request body and all further incoming requests that get handled by that particular thread are suspended indefinitely (until the service eventually OOMs or is restarted).
We are not able to reproduce the issue by design, but it reoccurs regularly with no observable trigger. Time between reoccurrence is anywhere from 3 days to a month and a half, which makes it difficult to diagnose.
Set up
Versions
Spring: 5.1.15.RELEASE Jetty: 9.4.29.v20200521 Java: OpenJDK Runtime Environment Corretto-11.0.8.10.1
Application configuration
We have a simple WebFlux reactive service that does the following:
- Given an incoming POST request with specific fields in the JSON request body:
- Either perform a computation based on the body fields and return a response
- Or make an outgoing
WebClient
call to retrieve additional data from another service and then perform the same above computation
- The service uses a
@RestController
style controller. - The service has a WebFilter that verifies and logs data about the incoming request's headers before it is processed by the controller.
- The WebFilter uses
publishOn(Schedulers.parallel())
(The particular work doesn't need to be scheduled off the main thread, but it is. This may be important later...) - The parallel scheduler was created with three threads due to the CPU configuration.
- The WebFilter uses
- Requests are made to this service from a Node.js application with a 300ms timeout.
Observed symptoms
- The issue first manifests as an increasing number of requests timing out from the client.
- Looking at the server, we see a monotonically increasing number of suspended requests with a corresponding increase in open file descriptors, the volume of which matches the volume of timed-out requests on the client side.
- Metrics show that 1/3rd of requests are getting suspended prior to making it to the controller, while the other 2/3rds of requests continue to be processed as expected.
- In the logs we see that the requests that continue to be processed are all handled on
parallel-2
andparallel-3
threads. - There is a distinct absence of requests being handled on
parallel-1
thread after the start of the "stalling" issue. - Running
lsof
on the server while it is in the stalled state, we see a significant number of connections to the server stuck in theCLOSED_WAIT
state, which matches the volume of timed-out client requests as well as the volume of suspended requests on the server.
Further investigation
- By connecting to the running service in the stalled state via JMX and VisualVM we see that
parallel-2
and-3
threads continue to process incoming requests, whileparallel-1
thread is stuck inWaiting
state. - According to the thread dump we see that the thread is blocked for content in the
ServletServerHttpRequest
java.lang.Thread.State: WAITING
at java.base@11.0.8/java.lang.Object.wait(Native Method)
- waiting on <4840a3f> (a java.util.ArrayDeque)
at java.base@11.0.8/java.lang.Object.wait(Object.java:328)
at app//org.eclipse.jetty.server.HttpInput.blockForContent(HttpInput.java:568)
at app//org.eclipse.jetty.server.HttpInput$1.blockForContent(HttpInput.java:1098)
at app//org.eclipse.jetty.server.HttpInput.read(HttpInput.java:318)
at java.base@11.0.8/java.io.InputStream.read(InputStream.java:205)
at app//org.springframework.http.server.reactive.ServletServerHttpRequest.readFromInputStream(ServletServerHttpRequest.java:211)
at app//org.springframework.http.server.reactive.ServletServerHttpRequest$RequestBodyPublisher.read(ServletServerHttpRequest.java:290)
at app//org.springframework.http.server.reactive.ServletServerHttpRequest$RequestBodyPublisher.read(ServletServerHttpRequest.java:266)
...
(Full trace and heap dump provided below).
Hypothesis
Our hypothesis is that something has caused one of the parallel threads to get blocked while reading the request body. While this thread is blocked, incoming requests begin to get suspended waiting for this thread to free up and process them. This causes the increase in suspended requests (and client-side timeouts as requests never get processed).
What we are not able to determine is what is the condition that causes the thread to block on the request body in the first place.
Further details
We have since upgraded the service to use the latest release of Spring and Jetty (5.2.6.RELEASE and 9.4.30.v20200611). We have not seen the issue reoccur yet after 1 day, but given the wide range of time between sporadic reoccurrences we cannot positively say whether this resolves the issue or not.
Comment From: scottjohnson
Here is the full thread dump as mentioned above. We can also provide a heap dump taken during the "stalled out" state if desired.
"parallel-1" - Thread t@79
java.lang.Thread.State: WAITING
at java.base@11.0.8/java.lang.Object.wait(Native Method)
- waiting on <4840a3f> (a java.util.ArrayDeque)
at java.base@11.0.8/java.lang.Object.wait(Object.java:328)
at app//org.eclipse.jetty.server.HttpInput.blockForContent(HttpInput.java:568)
at app//org.eclipse.jetty.server.HttpInput$1.blockForContent(HttpInput.java:1098)
at app//org.eclipse.jetty.server.HttpInput.read(HttpInput.java:318)
at java.base@11.0.8/java.io.InputStream.read(InputStream.java:205)
at app//org.springframework.http.server.reactive.ServletServerHttpRequest.readFromInputStream(ServletServerHttpRequest.java:211)
at app//org.springframework.http.server.reactive.ServletServerHttpRequest$RequestBodyPublisher.read(ServletServerHttpRequest.java:290)
at app//org.springframework.http.server.reactive.ServletServerHttpRequest$RequestBodyPublisher.read(ServletServerHttpRequest.java:266)
at app//org.springframework.http.server.reactive.AbstractListenerReadPublisher.readAndPublish(AbstractListenerReadPublisher.java:186)
at app//org.springframework.http.server.reactive.AbstractListenerReadPublisher.access$1100(AbstractListenerReadPublisher.java:48)
at app//org.springframework.http.server.reactive.AbstractListenerReadPublisher$State$4.onDataAvailable(AbstractListenerReadPublisher.java:374)
at app//org.springframework.http.server.reactive.AbstractListenerReadPublisher.onDataAvailable(AbstractListenerReadPublisher.java:118)
at app//org.springframework.http.server.reactive.ServletServerHttpRequest$RequestBodyPublisher.checkOnDataAvailable(ServletServerHttpRequest.java:282)
at app//org.springframework.http.server.reactive.AbstractListenerReadPublisher.changeToDemandState(AbstractListenerReadPublisher.java:222)
at app//org.springframework.http.server.reactive.AbstractListenerReadPublisher.access$1000(AbstractListenerReadPublisher.java:48)
at app//org.springframework.http.server.reactive.AbstractListenerReadPublisher$State$2.request(AbstractListenerReadPublisher.java:333)
at app//org.springframework.http.server.reactive.AbstractListenerReadPublisher$ReadSubscription.request(AbstractListenerReadPublisher.java:260)
at app//reactor.core.publisher.FluxMapSignal$FluxMapSignalSubscriber.request(FluxMapSignal.java:225)
at app//reactor.core.publisher.FluxFlatMap$FlatMapMain.onSubscribe(FluxFlatMap.java:366)
at app//reactor.core.publisher.FluxMapSignal$FluxMapSignalSubscriber.onSubscribe(FluxMapSignal.java:115)
at app//com.opentable.conservedheaders.reactive.MdcContextSynchronizer.onSubscribe(MdcContextSynchronizer.java:36)
at app//org.springframework.http.server.reactive.AbstractListenerReadPublisher$State$1.subscribe(AbstractListenerReadPublisher.java:301)
at app//org.springframework.http.server.reactive.AbstractListenerReadPublisher.subscribe(AbstractListenerReadPublisher.java:105)
at app//reactor.core.publisher.FluxSource.subscribe(FluxSource.java:52)
at app//reactor.core.publisher.FluxLift.subscribe(FluxLift.java:54)
at app//reactor.core.publisher.FluxMapSignal.subscribe(FluxMapSignal.java:69)
at app//reactor.core.publisher.FluxFlatMap.subscribe(FluxFlatMap.java:97)
at app//reactor.core.publisher.FluxLift.subscribe(FluxLift.java:54)
at app//reactor.core.publisher.FluxFlatMap.subscribe(FluxFlatMap.java:97)
at app//reactor.core.publisher.FluxLift.subscribe(FluxLift.java:54)
at app//reactor.core.publisher.MonoSingle.subscribe(MonoSingle.java:58)
at app//reactor.core.publisher.MonoLiftFuseable.subscribe(MonoLiftFuseable.java:63)
at app//reactor.core.publisher.MonoOnErrorResume.subscribe(MonoOnErrorResume.java:44)
at app//reactor.core.publisher.MonoLift.subscribe(MonoLift.java:46)
at app//reactor.core.publisher.MonoSwitchIfEmpty.subscribe(MonoSwitchIfEmpty.java:44)
at app//reactor.core.publisher.MonoLift.subscribe(MonoLift.java:46)
at app//reactor.core.publisher.MonoDefaultIfEmpty.subscribe(MonoDefaultIfEmpty.java:37)
at app//reactor.core.publisher.MonoLift.subscribe(MonoLift.java:46)
at app//reactor.core.publisher.MonoPeek.subscribe(MonoPeek.java:71)
at app//reactor.core.publisher.MonoLift.subscribe(MonoLift.java:46)
at app//reactor.core.publisher.Mono.subscribe(Mono.java:3879)
at app//reactor.core.publisher.MonoZip.subscribe(MonoZip.java:128)
at app//reactor.core.publisher.MonoLift.subscribe(MonoLift.java:46)
at app//reactor.core.publisher.MonoFlatMap.subscribe(MonoFlatMap.java:60)
at app//reactor.core.publisher.MonoLiftFuseable.subscribe(MonoLiftFuseable.java:63)
at app//reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:52)
at app//reactor.core.publisher.MonoLift.subscribe(MonoLift.java:46)
at app//reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.drain(MonoIgnoreThen.java:153)
at app//reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:56)
at app//reactor.core.publisher.MonoLiftFuseable.subscribe(MonoLiftFuseable.java:63)
at app//reactor.core.publisher.MonoPeekFuseable.subscribe(MonoPeekFuseable.java:74)
at app//reactor.core.publisher.MonoLiftFuseable.subscribe(MonoLiftFuseable.java:63)
at app//reactor.core.publisher.MonoPeekFuseable.subscribe(MonoPeekFuseable.java:74)
at app//reactor.core.publisher.MonoLiftFuseable.subscribe(MonoLiftFuseable.java:63)
at app//reactor.core.publisher.MonoOnErrorResume.subscribe(MonoOnErrorResume.java:44)
at app//reactor.core.publisher.MonoLift.subscribe(MonoLift.java:46)
at app//reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:150)
at app//com.opentable.conservedheaders.reactive.MdcContextSynchronizer.onNext(MdcContextSynchronizer.java:42)
at app//reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:67)
at app//com.opentable.conservedheaders.reactive.MdcContextSynchronizer.onNext(MdcContextSynchronizer.java:42)
at app//reactor.core.publisher.MonoNext$NextSubscriber.onNext(MonoNext.java:76)
at app//com.opentable.conservedheaders.reactive.MdcContextSynchronizer.onNext(MdcContextSynchronizer.java:42)
at app//reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.innerNext(FluxConcatMap.java:275)
at app//reactor.core.publisher.FluxConcatMap$ConcatMapInner.onNext(FluxConcatMap.java:849)
at app//com.opentable.conservedheaders.reactive.MdcContextSynchronizer.onNext(MdcContextSynchronizer.java:42)
at app//reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:121)
at app//com.opentable.conservedheaders.reactive.MdcContextSynchronizer.onNext(MdcContextSynchronizer.java:42)
at app//reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:127)
at app//reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2185)
at app//reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.request(FluxHide.java:142)
at app//reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.request(FluxMapFuseable.java:162)
at app//reactor.core.publisher.Operators$MultiSubscriptionSubscriber.set(Operators.java:1993)
at app//reactor.core.publisher.Operators$MultiSubscriptionSubscriber.onSubscribe(Operators.java:1867)
at app//com.opentable.conservedheaders.reactive.MdcContextSynchronizer.onSubscribe(MdcContextSynchronizer.java:36)
at app//reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onSubscribe(FluxMapFuseable.java:90)
at app//com.opentable.conservedheaders.reactive.MdcContextSynchronizer.onSubscribe(MdcContextSynchronizer.java:36)
at app//reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onSubscribe(FluxHide.java:113)
at app//reactor.core.publisher.MonoJust.subscribe(MonoJust.java:54)
at app//reactor.core.publisher.MonoLiftFuseable.subscribe(MonoLiftFuseable.java:63)
at app//reactor.core.publisher.MonoMapFuseable.subscribe(MonoMapFuseable.java:59)
at app//reactor.core.publisher.MonoLiftFuseable.subscribe(MonoLiftFuseable.java:63)
at app//reactor.core.publisher.Mono.subscribe(Mono.java:3879)
at app//reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.drain(FluxConcatMap.java:442)
at app//reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.onSubscribe(FluxConcatMap.java:212)
at app//com.opentable.conservedheaders.reactive.MdcContextSynchronizer.onSubscribe(MdcContextSynchronizer.java:36)
at app//reactor.core.publisher.FluxIterable.subscribe(FluxIterable.java:161)
at app//reactor.core.publisher.FluxIterable.subscribe(FluxIterable.java:86)
at app//reactor.core.publisher.FluxLiftFuseable.subscribe(FluxLiftFuseable.java:78)
at app//reactor.core.publisher.FluxConcatMap.subscribe(FluxConcatMap.java:121)
at app//reactor.core.publisher.FluxLift.subscribe(FluxLift.java:54)
at app//reactor.core.publisher.MonoNext.subscribe(MonoNext.java:40)
at app//reactor.core.publisher.MonoLift.subscribe(MonoLift.java:46)
at app//reactor.core.publisher.MonoSwitchIfEmpty.subscribe(MonoSwitchIfEmpty.java:44)
at app//reactor.core.publisher.MonoLift.subscribe(MonoLift.java:46)
at app//reactor.core.publisher.MonoFlatMap.subscribe(MonoFlatMap.java:60)
at app//reactor.core.publisher.MonoLiftFuseable.subscribe(MonoLiftFuseable.java:63)
at app//reactor.core.publisher.MonoFlatMap.subscribe(MonoFlatMap.java:60)
at app//reactor.core.publisher.MonoLiftFuseable.subscribe(MonoLiftFuseable.java:63)
at app//reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:52)
at app//reactor.core.publisher.MonoLift.subscribe(MonoLift.java:46)
at app//reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.drain(MonoIgnoreThen.java:153)
at app//reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.ignoreDone(MonoIgnoreThen.java:190)
at app//reactor.core.publisher.MonoIgnoreThen$ThenIgnoreInner.onComplete(MonoIgnoreThen.java:240)
at app//com.opentable.conservedheaders.reactive.MdcContextSynchronizer.onComplete(MdcContextSynchronizer.java:53)
at app//reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.onComplete(FluxPeekFuseable.java:271)
at app//com.opentable.conservedheaders.reactive.MdcContextSynchronizer.onComplete(MdcContextSynchronizer.java:53)
at app//reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onComplete(FluxHide.java:137)
at app//reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.onComplete(FluxPeekFuseable.java:271)
at app//com.opentable.conservedheaders.reactive.MdcContextSynchronizer.onComplete(MdcContextSynchronizer.java:53)
at app//reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onComplete(FluxHide.java:137)
at app//reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1624)
at app//reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:241)
at app//com.opentable.conservedheaders.reactive.MdcContextSynchronizer.onNext(MdcContextSynchronizer.java:42)
at app//reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1623)
at app//reactor.core.publisher.MonoCompletionStage.lambda$subscribe$0(MonoCompletionStage.java:82)
at app//reactor.core.publisher.MonoCompletionStage$$Lambda$970/0x0000000100785c40.accept(Unknown Source)
at java.base@11.0.8/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
at java.base@11.0.8/java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:883)
at java.base@11.0.8/java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2251)
at java.base@11.0.8/java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:143)
at app//reactor.core.publisher.MonoCompletionStage.subscribe(MonoCompletionStage.java:60)
at app//reactor.core.publisher.MonoLiftFuseable.subscribe(MonoLiftFuseable.java:63)
at app//reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:150)
at app//com.opentable.conservedheaders.reactive.MdcContextSynchronizer.onNext(MdcContextSynchronizer.java:42)
at app//reactor.core.publisher.MonoPublishOn$PublishOnSubscriber.run(MonoPublishOn.java:178)
at app//reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:50)
at app//reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:27)
at java.base@11.0.8/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base@11.0.8/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base@11.0.8/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base@11.0.8/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base@11.0.8/java.lang.Thread.run(Thread.java:834)
Locked ownable synchronizers:
- locked <2c37e796> (a java.util.concurrent.ThreadPoolExecutor$Worker)
Comment From: rstoyanchev
The issue is probably compounded by the use of Schedulers.parallel())
depending on how the ExecutorService
queues tasks but even so there is a small number of threads and they'll run out quickly if more get blocked.
The blocking happens inside Jetty. I wonder what the Jetty team would say given stack trace, why would this block when using the non-blocking I/O Servlet API? Is it something about the way it's used or an internal problem? /cc @sbordet, @joakime
Note that we did fix a couple of in 5.1.6, #25102 and #25004, but no idea if that's related.
Comment From: joakime
Is that stacktrace from the combination of Spring: 5.1.15.RELEASE and Jetty 9.4.29.v20200521?
Comment From: scottjohnson
The issue is probably compounded by the use of Schedulers.parallel()) depending on how the ExecutorService queues tasks but even so there is a small number of threads and they'll run out quickly if more get blocked.
Yes agreed. The use of Schedulers.parallel()
was unnecessary and only served to exacerbate the effect of the underlying blocking issue.
Is that stacktrace from the combination of Spring: 5.1.15.RELEASE and Jetty 9.4.29.v20200521?
@joakime yes that trace is from 5.1.15 and 9.4.29.
Comment From: scottjohnson
Note that we did fix a couple of in 5.1.6, #25102 and #25004, but no idea if that's related.
We deployed the updated code with 5.2.6, so fingers crossed that this was related and that it doesn't reoccur.
Short of any insights from the Jetty crew, we'll just sit on that and report back if it either happens again, or enough time elapses that it seems resolved. Hard to prove a positive from the absence of evidence, but sometimes that's life for you.
Comment From: sbordet
@rstoyanchev wrote:
The blocking happens inside Jetty. I wonder what the Jetty team would say given stack trace, why would this block when using the non-blocking I/O Servlet API?
The stack trace definitely shows that the reads are not using the Servlet 3.1 non-blocking I/O APIs, but the old blocking ones, so blocking is expected to happen if the content is not arrived completely.
If JSON content is read directly by the JSON parser, make sure that you read from the ServletInputStream
until you read -1
.
It is common to not read the -1
because the JSON parser reports that all JSON has arrived, but there may be additional bytes to read that e.g. have been split by TCP (or sent later by the client).
In case of HTTP/1.1 chunked content sent by the client, it may happen that the JSON content arrives, but the terminating chunk has not arrived yet - and that will block the read (which will return -1 after reading the terminal chunk). Same happens with HTTP/2, where the JSON content may have arrived in a DATA frame, but the terminating DATA frame has not arrived yet.
To diagnose the problem you may inject a Jetty Server dump
when the timeout happens.
I think you can use the Flux/Mono
APIs so that upon timeout, you call Server.dumpStdErr()
on the Server
instance which will dump the state of the Jetty server to stderr
. There, we will be able to tell in what state is the HttpInput
and why it is blocking.
Most Jetty components are dumpable so you may dump only some of them - dumping the Server
will dump them all.
Another hypothesis is that the client is at fault, and sometimes it does not send the whole JSON content, so the server blocks waiting for the rest of the content that is not arriving.
Comment From: rstoyanchev
Thanks for the analysis @sbordet !
The stack trace definitely shows that the reads are not using the Servlet 3.1 non-blocking I/O APIs, but the old blocking ones, so blocking is expected to happen if the content is not arrived completely.
This is interesting. WebFlux only uses Servlet 3.1 non-blocking I/O. Here you can see the Servlet which starts an async request immediately, creates request and response adapters to Reactive Streams, invokes the handler which returns Mono<Void>
and subscribes to the result. Inside ServletServerHttpRequest
, the constructor immediately registers a ReadListener
and likewise the response registers a WriteListener
and that should result in non-blocking I/O. I'm not sure why there is blocking I/O?
If JSON content is read directly by the JSON parser
WebFlux has a Servlet to Reactive Streams bridge where through a ReadListener
we produce a Flux<DataBuffer>
for the request body which in turn is read by Decoder
implementations one of which is for Jackson. We use Jackson's async parser but at that point we're feeding it buffers until the Flux completes. In other words the details of reading from the Servlet API is entirely abstracted, and there are other non-Servlet APIs adapted to the same request and response API.
I think you can use the Flux/Mono APIs so that upon timeout, you call Server.dumpStdErr() on the Server instance which will dump the state of the Jetty server to stderr
Indeed a timeout operator can be inserted, probably in a WebFilter
.
Comment From: sbordet
I'm not sure why there is blocking I/O?
I'm not sure either. Perhaps a race condition where the request has been read and reset already (going back to blocking mode), when this thread tries to further read from it?
Comment From: rstoyanchev
@scottjohnson, as I mentioned in https://github.com/spring-projects/spring-framework/issues/25620#issuecomment-678179277 we did fix concurrency issues in our Servlet / Reactive bridge which could have caused hanging like this. Last time you mentioned that after an upgrade, the you hadn't seen the issue. Do you have an update on whether the issue went away?
Comment From: spring-projects-issues
If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.
Comment From: spring-projects-issues
Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open the issue.