Affects: \v5.3.3

The symptom of this issue is similar to previous issue https://github.com/spring-projects/spring-framework/issues/23096 (response handling never completes) though the underlying cause is different.

I've reproduced this issue with the small demo spring-boot project which uses WebFlux on Tomcat, seen here: https://github.com/danielra/webflux_race_investigation The project can be built with ./gradlew clean build and then run via ./start.sh. To reproduce the problem I've been successful using wrk (https://github.com/wg/wrk) to throw some load at the service, stop and then throw some more etc.. For example:

for i in {1..100}; do   ./wrk -t 2 -c 1000 -d 10s --latency --timeout 9s --script ./myPost.lua http://localhost:8080/post; done

Where ./myPost.lua looks like:

wrk.method = "POST"
wrk.body   = "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]"
wrk.headers["Content-Type"] = "application/json"

(I don't think this being a POST request really ended up being relevant, but I started here because I was trying to start the more isolated repro somewhere somewhat similar to the real system that was experiencing the problem.)

This will result in a lot of trace level logging under a logs directory in the demo project. This can be searched through to find any occurrences like:

find . -name \*.gz -print0 | xargs -0 zgrep "Response timeout"

And then another search for the logPrefix found in resulting log entries can be performed to gain more context on the processing for the relevant request.

This is an example set of log entries for a repro case (with irrelevant line prefixes ommitted):

12:31:47,007 INFO c.e.d.f.DemoWebFilter [http-nio-8080-exec-368] [33d6780] Incoming request observed.
12:31:47,007 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-368] [33d6780] UNSUBSCRIBED -> SUBSCRIBING
12:31:47,007 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-368] [33d6780] 9223372036854775807 requested
12:31:47,007 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-368] [33d6780] SUBSCRIBING -> DEMAND
12:31:47,007 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-368] [33d6780] onDataAvailable
12:31:47,007 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-368] [33d6780] DEMAND -> READING
12:31:47,007 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-368] [33d6780] Read 8092 bytes
12:31:47,007 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-368] [33d6780] Publishing data read
12:31:47,007 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-368] [33d6780] Read 8192 bytes
12:31:47,007 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-368] [33d6780] Publishing data read
12:31:47,007 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-368] [33d6780] Read 4067 bytes
12:31:47,007 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-368] [33d6780] Publishing data read
12:31:47,007 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-368] [33d6780] No more data to read
12:31:47,007 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-368] [33d6780] READING -> DEMAND
12:31:47,007 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-368] [33d6780] onAllDataRead
12:31:47,007 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-368] [33d6780] DEMAND -> COMPLETED
12:31:47,780 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-2] [33d6780] UNSUBSCRIBED -> REQUESTED
12:31:47,780 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-2] [33d6780] Received onNext publisher
12:31:47,780 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-2] [33d6780] REQUESTED -> RECEIVED
12:31:47,780 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [33d6780] UNSUBSCRIBED -> REQUESTED
12:31:47,780 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [33d6780] Item to write
12:31:47,780 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [33d6780] REQUESTED -> RECEIVED
12:31:47,780 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [33d6780] onWritePossible
12:31:47,780 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [33d6780] RECEIVED -> WRITING
12:31:47,780 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [33d6780] Wrote 12931 of 12931 bytes
12:31:47,780 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [33d6780] WRITING -> REQUESTED
12:31:47,780 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [33d6780] No more items to write
12:31:47,780 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [33d6780] REQUESTED -> RECEIVED
12:31:47,780 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [33d6780] isWritePossible: false
12:31:47,780 TRACE _.s.h.s.r.WriteResultPublisher [parallel-2] [33d6780] UNSUBSCRIBED subscribe: org.springframework.http.server.reactive.AbstractListenerWriteFlushProcessor$State$WriteResultSubscriber@7a996bbf
12:31:47,780 TRACE _.s.h.s.r.WriteResultPublisher [parallel-2] [33d6780] SUBSCRIBING request: 9223372036854775807
12:31:47,780 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-2] [33d6780] Received onComplete
12:31:47,780 TRACE _.s.h.s.r.WriteResultPublisher [parallel-2] [33d6780] UNSUBSCRIBED subscribe: reactor.core.publisher.MonoNext$NextSubscriber@53a4d00d
12:31:47,780 TRACE _.s.h.s.r.WriteResultPublisher [parallel-2] [33d6780] SUBSCRIBING request: 9223372036854775807
12:31:47,781 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [http-nio-8080-exec-348] [33d6780] onWritePossible
12:31:47,781 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [http-nio-8080-exec-348] [33d6780] RECEIVED -> COMPLETED
12:31:47,781 TRACE _.s.h.s.r.WriteResultPublisher [http-nio-8080-exec-348] [33d6780] SUBSCRIBED publishComplete
12:31:47,781 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-348] [33d6780] RECEIVED writeComplete
12:31:47,781 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-348] [33d6780] Flush attempt
12:31:47,781 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-348] [33d6780] RECEIVED flushingFailed
12:31:59,588 TRACE _.s.h.s.r.AbstractListenerReadPublisher [parallel-4] [33d6780] Cancellation
12:31:59,588 TRACE _.s.h.s.r.WriteResultPublisher [parallel-4] [33d6780] SUBSCRIBED cancel
12:31:59,588 ERROR c.e.d.f.DemoWebFilter [parallel-4] [33d6780] Response timeout after 12581 milliseconds for null request with uri 'http://localhost:8080/post'. Response status code was already committed: '200 OK'.                                                                                                                                                                                                               12:31:59,588 TRACE o.s.h.s.r.ServletHttpHandlerAdapter [parallel-4] [33d6780] Handling completed

Here the line 12:31:47,781 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-348] [33d6780] RECEIVED flushingFailed comes from a modification I made locally to the spring-framework to add an additional trace log line here: https://github.com/spring-projects/spring-framework/blob/v5.3.3/spring-web/src/main/java/org/springframework/http/server/reactive/AbstractListenerWriteFlushProcessor.java#L194-L195 like so:

if (rsWriteFlushLogger.isTraceEnabled()) {
            rsWriteFlushLogger.trace(this.logPrefix + this.state + " flushingFailed", t);
}

For reference, the relevant Throwable logged at these lines in these reproductions with this demo project looks like:

org.apache.catalina.connector.ClientAbortException: java.io.IOException: Broken pipe
        at org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:309) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.catalina.connector.OutputBuffer.flush(OutputBuffer.java:272) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.catalina.connector.CoyoteOutputStream.flush(CoyoteOutputStream.java:118) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.springframework.http.server.reactive.ServletServerHttpResponse.flush(ServletServerHttpResponse.java:198) ~[spring-web-5.3.3.jar!/:5.3.3]
        at org.springframework.http.server.reactive.ServletServerHttpResponse.access$500(ServletServerHttpResponse.java:50) ~[spring-web-5.3.3.jar!/:5.3.3]
        at org.springframework.http.server.reactive.ServletServerHttpResponse$ResponseBodyFlushProcessor.flush(ServletServerHttpResponse.java:316) ~[spring-web-5.3.3.jar!/:5.3.3]
        at org.springframework.http.server.reactive.AbstractListenerWriteFlushProcessor$State$3.writeComplete(AbstractListenerWriteFlushProcessor.java:291) ~[spring-web-5.3.3.jar!/:5.3.3]
        at org.springframework.http.server.reactive.AbstractListenerWriteFlushProcessor$State$WriteResultSubscriber.onComplete(AbstractListenerWriteFlushProcessor.java:437) ~[spring-web-5.3.3.jar!/:5.3.3]
        at org.springframework.http.server.reactive.WriteResultPublisher$State.publishComplete(WriteResultPublisher.java:256) ~[spring-web-5.3.3.jar!/:5.3.3]
        at org.springframework.http.server.reactive.WriteResultPublisher.publishComplete(WriteResultPublisher.java:84) ~[spring-web-5.3.3.jar!/:5.3.3]
        at org.springframework.http.server.reactive.AbstractListenerWriteProcessor.changeStateToComplete(AbstractListenerWriteProcessor.java:280) ~[spring-web-5.3.3.jar!/:5.3.3]
        at org.springframework.http.server.reactive.AbstractListenerWriteProcessor.access$300(AbstractListenerWriteProcessor.java:46) ~[spring-web-5.3.3.jar!/:5.3.3]
        at org.springframework.http.server.reactive.AbstractListenerWriteProcessor$State$3.onWritePossible(AbstractListenerWriteProcessor.java:368) ~[spring-web-5.3.3.jar!/:5.3.3]
        at org.springframework.http.server.reactive.AbstractListenerWriteProcessor.onWritePossible(AbstractListenerWriteProcessor.java:153) ~[spring-web-5.3.3.jar!/:5.3.3]
        at org.springframework.http.server.reactive.ServletServerHttpResponse$ResponseBodyWriteListener.onWritePossible(ServletServerHttpResponse.java:270) ~[spring-web-5.3.3.jar!/:5.3.3]
        at org.apache.coyote.Response.onWritePossible(Response.java:762) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.catalina.connector.CoyoteAdapter.asyncDispatch(CoyoteAdapter.java:188) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.coyote.AbstractProcessor.dispatch(AbstractProcessor.java:241) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:59) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:888) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1597) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_282]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_282]
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
Caused by: java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:1.8.0_282]
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[?:1.8.0_282]
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[?:1.8.0_282]
        at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[?:1.8.0_282]
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470) ~[?:1.8.0_282]
        at org.apache.tomcat.util.net.NioChannel.write(NioChannel.java:138) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite(NioEndpoint.java:1269) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.tomcat.util.net.SocketWrapperBase.doWrite(SocketWrapperBase.java:764) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.tomcat.util.net.SocketWrapperBase.flushNonBlocking(SocketWrapperBase.java:735) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.tomcat.util.net.SocketWrapperBase.flush(SocketWrapperBase.java:709) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.coyote.http11.Http11OutputBuffer$SocketOutputBuffer.flush(Http11OutputBuffer.java:572) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.coyote.http11.filters.IdentityOutputFilter.flush(IdentityOutputFilter.java:117) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.coyote.http11.Http11OutputBuffer.flush(Http11OutputBuffer.java:220) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.coyote.http11.Http11Processor.flush(Http11Processor.java:1195) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.coyote.AbstractProcessor.action(AbstractProcessor.java:402) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.coyote.Response.action(Response.java:209) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        at org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:305) ~[tomcat-embed-core-9.0.41.jar!/:9.0.41]
        ... 25 more

And appears to me to be a result of the client hanging up the connection before response handling is complete on the server.

After this flushingFailed log line, we can see a relatively large (~12 second) gap in time before the next log line which was triggered by the response timeout being triggered which is setup in the DemoWebFilter in the demo project here: https://github.com/danielra/webflux_race_investigation/blob/main/src/main/java/com/example/demo/filter/DemoWebFilter.java#L38 And only after this timeout does ServletHttpHandlerAdapter log a line indicating that handling is complete.

I note that the flushingFailed method where I added the extra log line above has a comment which reads:

Invoked when an error happens while flushing. Sub-classes may choose to ignore this if they know the underlying API will provide an error notification in a container thread. Defaults to no-op.

Based on this description and the observed behavior, it seems to me that perhaps a notification of the error was expected elsewhere, but it was never actually received so handling of the response is never completed (without a timeout present). I don't know enough yet to say whether that was due to a bug in the code that should have emitted the error notification or on the listening side - or if there should actually be a real non-no-op operation implemented of flushingFailed in the relevant subclass in the Tomcat case. However, my expectation is that regardless of when a client hangs up the connection the server should proceed with completing response handling.

Please let me know if any additional information would be helpful. Thank you for your time!


Comment From: rstoyanchev

Thanks for the detailed report.

We're currently not handling AbstractListenerWriteFlushProcessor#flushingFailed for Servlet containers because the expectation is the Servlet container will call our ServletServerHttpResponse$ResponseAsyncListener#onError which in turn will cancel the writing publisher and send an onError for the result of handling. However this is not happening in this case probably because in this case we're already inside a container thread (via onWritePossible) and Tomcat expects us to propagate the error which would then be handled in CoyoteAdapter.asyncDispatch by calling the AsyncListener.

I've made a change in flushingFailed to cancel the write publisher and send an onError downstream. The sample no longer produces those "Response timeout" messages but if you can verify in your own code to confirm that would be great!

Comment From: danielra

Thanks for the explanation! I ran the demo scenario again, and I no longer see response timeouts associated with the flushingFailed case - but I have noticed a new timeout case. The trace logs look like this:

14:25:08,314 INFO c.e.d.f.DemoWebFilter [http-nio-8080-exec-14] [47665100] Incoming request observed.
14:25:08,314 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-14] [47665100] UNSUBSCRIBED -> SUBSCRIBING
14:25:08,314 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-14] [47665100] 9223372036854775807 requested
14:25:08,314 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-14] [47665100] SUBSCRIBING -> DEMAND
14:25:08,314 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-14] [47665100] onDataAvailable
14:25:08,314 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-14] [47665100] DEMAND -> READING
14:25:08,314 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-14] [47665100] Read 8092 bytes
14:25:08,314 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-14] [47665100] Publishing data read
14:25:08,314 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-14] [47665100] Read 8192 bytes
14:25:08,314 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-14] [47665100] Publishing data read
14:25:08,314 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-14] [47665100] Read 4067 bytes
14:25:08,314 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-14] [47665100] Publishing data read
14:25:08,314 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-14] [47665100] No more data to read
14:25:08,314 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-14] [47665100] READING -> DEMAND
14:25:08,314 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-14] [47665100] onAllDataRead
14:25:08,314 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-14] [47665100] DEMAND -> COMPLETED
14:25:11,256 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-1] [47665100] UNSUBSCRIBED -> REQUESTED
14:25:11,256 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-1] [47665100] Received onNext publisher
14:25:11,256 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-1] [47665100] REQUESTED -> RECEIVED
14:25:11,256 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-1] [47665100] UNSUBSCRIBED -> REQUESTED
14:25:11,256 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-1] [47665100] Item to write
14:25:11,256 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-1] [47665100] REQUESTED -> RECEIVED
14:25:11,256 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-1] [47665100] onWritePossible
14:25:11,256 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-1] [47665100] RECEIVED -> WRITING
14:25:11,262 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-1] [47665100] Wrote 468993 of 468993 bytes
14:25:11,262 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-1] [47665100] WRITING -> REQUESTED
14:25:11,262 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-1] [47665100] No more items to write
14:25:11,262 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-1] [47665100] REQUESTED -> RECEIVED
14:25:11,262 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-1] [47665100] isWritePossible: false
14:25:11,262 TRACE _.s.h.s.r.WriteResultPublisher [parallel-1] [47665100] UNSUBSCRIBED subscribe: org.springframework.http.server.reactive.AbstractListenerWriteFlushProcessor$State$WriteResultSubscriber@768a373c
14:25:11,262 TRACE _.s.h.s.r.WriteResultPublisher [parallel-1] [47665100] SUBSCRIBING request: 9223372036854775807
14:25:11,262 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-1] [47665100] Received onComplete
14:25:11,262 TRACE _.s.h.s.r.WriteResultPublisher [parallel-1] [47665100] UNSUBSCRIBED subscribe: reactor.core.publisher.MonoNext$NextSubscriber@2455b45e
14:25:11,262 TRACE _.s.h.s.r.WriteResultPublisher [parallel-1] [47665100] SUBSCRIBING request: 9223372036854775807
14:25:19,015 TRACE _.s.h.s.r.AbstractListenerReadPublisher [parallel-4] [47665100] Cancellation
14:25:19,015 TRACE _.s.h.s.r.WriteResultPublisher [parallel-4] [47665100] SUBSCRIBED cancel
14:25:19,015 ERROR c.e.d.f.DemoWebFilter [parallel-4] [47665100] Response timeout after 10701 milliseconds for POST request with uri 'http://localhost:8080/post'. Response status code was already committed: '200 OK'.
14:25:19,015 TRACE o.s.h.s.r.ServletHttpHandlerAdapter [parallel-4] [47665100] Handling completed
14:25:19,040 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-95] [47665100] onAllDataRead
14:25:19,040 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-95] [47665100] Received request to cancel
14:25:19,040 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-95] [47665100] Received onComplete
14:25:19,040 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [http-nio-8080-exec-95] [47665100] Cancellation
14:25:19,040 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [http-nio-8080-exec-95] [47665100] No more items to write

In this case, what I note is that this line occurs:

14:25:11,262 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-1] [47665100] Received onComplete

Aparently while in the RECEIVED state, indicating that processor.subscriberCompleted would be set to true (here: https://github.com/spring-projects/spring-framework/blob/v5.3.3/spring-web/src/main/java/org/springframework/http/server/reactive/AbstractListenerWriteFlushProcessor.java#L306 ) but then there doesn't seem to be an expected writeComplete call to trigger the call to handleSubscriberCompleted that would occur here: https://github.com/spring-projects/spring-framework/blob/v5.3.3/spring-web/src/main/java/org/springframework/http/server/reactive/AbstractListenerWriteFlushProcessor.java#L295-L296 . I haven't had time to dig in further yet. Please let me know if you think opening a new issue is merited, since this is no longer the original flush failure case.

Comment From: rstoyanchev

Thanks for testing. I'll have another look.

Comment From: rstoyanchev

but then there doesn't seem to be an expected writeComplete call to trigger the call to handleSubscriberCompleted

Indeed and that call should be made when the "current" AbstractListenerWriteProcessor notifies completion via publishComplete() to its WriteResultPublisher.

From the output of the current write processor it looks like it writes its single item and requests another, then receives onComplete and switches back to RECEIVED but gets "false" from ServletOutputStream#isReady():

14:25:11,262 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-1] [47665100] No more items to write
14:25:11,262 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-1] [47665100] REQUESTED -> RECEIVED
14:25:11,262 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-1] [47665100] isWritePossible: false

So at this point it's waiting on a WriterListener#onWritePossible from the container which would make it advance to COMPLETED but that doesn't happen before the timeout.

Any tips on how to reproduce this one?

Comment From: danielra

I seem to be able to reliably get some reproductions of this on my machine when running the steps described in my original report on this issue. With the one difference being that I have an the following includeBuild clause in my settings.gradle of the demo project:

includeBuild '/home/daniel/webflux_race_investigation/spring-framework/'

Which is directory with a local clone of spring-framework checked out at tag v5.3.3 with this diff applied representing your prior fix (and an additional log line):

git diff HEAD
diff --git a/spring-web/src/main/java/org/springframework/http/server/reactive/AbstractListenerWriteFlushProcessor.java b/spring-web/src/main/java/org/springframework/http/server/reactive/AbstractListenerWriteFlushProcessor.java
index 81c8098d32..d124192c7a 100644
--- a/spring-web/src/main/java/org/springframework/http/server/reactive/AbstractListenerWriteFlushProcessor.java
+++ b/spring-web/src/main/java/org/springframework/http/server/reactive/AbstractListenerWriteFlushProcessor.java
@@ -192,6 +192,11 @@ public abstract class AbstractListenerWriteFlushProcessor<T> implements Processo
         * <p>Defaults to no-op.
         */
        protected void flushingFailed(Throwable t) {
+               if (rsWriteFlushLogger.isTraceEnabled()) {
+                       rsWriteFlushLogger.trace(this.logPrefix + this.state + " flushingFailed", t);
+               }
+               cancel();
+               onError(t);
        }

Assuming that it is not reproducing as easily for you, is there anywhere in particular that you think it would be interesting for me to add additional trace logging to gather more information? I should have a bit of time tomorrow to look at this again.

Comment From: rstoyanchev

After the last fix I wasn't seeing timeouts any more and in your latest log I saw a 3 second delay for the handling, so I thought there might be some different instructions.

Today I could get several timeouts waiting on a container callback, sometimes its the "current processor" waiting for onWritePossible and other times it makes it to the final flush and hangs there:

2021-02-11 11:42:41,960 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-187] [6970d84f] RECEIVED writeComplete
2021-02-11 11:42:41,960 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-187] [6970d84f] Flush attempt
2021-02-11 11:42:50,232 TRACE _.s.h.s.r.AbstractListenerReadPublisher [parallel-3] [6970d84f] Cancellation

If the flush had completed we should see RECEIVED -> REQUESTED immediately after, or a cancellation/error if the flush failed but we don't so I think it is stuck trying to flush.

It occurred to me it might be related to the load, so I increased the timeout to 30 seconds but still got the timeouts. The other thing is that ThreadLocalRandom can be CPU intensive and block which may exhaust the container threads which runs with a small number of threads for non-blocking I/O. So I tried removing those with fixed numbers:

// final int delayMillis = ThreadLocalRandom.current().nextInt(10) + latencyBase;
// final int intListSize = ThreadLocalRandom.current().nextInt(MAX_INT_LIST_SIZE);
final int delayMillis = 10 + latencyBase;
final int intListSize = MAX_INT_LIST_SIZE;

Now I can no longer reproduce it.

Comment From: danielra

Yeah, I too was starting to wonder about the possibility of just unlucky scheduling causing occasional highly latent responses under load. However, I've done a bunch more experimenting and I do still think there is some issue in play here. I similarly hard-coded the delay and response size (though the intent of the -Djava.security.egd=file:/dev/urandom option in the demo start script was to avoid the random generation potentially blocking), increased the response timeout to 60 seconds, and additionally added a 10 second sleep between the wrk runs to allow CPU to drop back to near zero for a period (intending to give plenty of opportunity for catch up) - and I still am able to reproduce the response timeouts. The repros are a bit less frequent without the random timing/sizing though. For example:

12:25:25,726 INFO c.e.d.f.DemoWebFilter [http-nio-8080-exec-111] [5ac5ec14] Incoming request observed.
12:25:25,726 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-111] [5ac5ec14] UNSUBSCRIBED -> SUBSCRIBING                                                                                        
12:25:25,726 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-111] [5ac5ec14] 9223372036854775807 requested                                                                                      
12:25:25,726 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-111] [5ac5ec14] SUBSCRIBING -> DEMAND                                                                                              
12:25:25,726 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-111] [5ac5ec14] onDataAvailable                                                                                                    
12:25:25,726 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-111] [5ac5ec14] DEMAND -> READING                                                                                                  
12:25:25,726 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-111] [5ac5ec14] Read 8092 bytes                                                                                                    
12:25:25,726 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-111] [5ac5ec14] Publishing data read                                                                                               
12:25:25,726 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-111] [5ac5ec14] Read 8192 bytes                                                                                                    
12:25:25,726 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-111] [5ac5ec14] Publishing data read                                                                                               
12:25:25,726 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-111] [5ac5ec14] Read 4067 bytes                                                                                                    
12:25:25,726 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-111] [5ac5ec14] Publishing data read                                                                                               
12:25:25,726 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-111] [5ac5ec14] No more data to read                                                                                               
12:25:25,726 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-111] [5ac5ec14] READING -> DEMAND                                                                                                  
12:25:25,726 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-111] [5ac5ec14] onAllDataRead                                                                                                      
12:25:25,726 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-111] [5ac5ec14] DEMAND -> COMPLETED                                                                                                
12:25:25,726 INFO c.e.d.c.DemoController [http-nio-8080-exec-111] [5ac5ec14] Controller called. Delay: 6. Size: 30519                                                                                             
12:25:25,795 INFO c.e.d.c.DemoController [parallel-2] [5ac5ec14] Mono returned from Controller doOnSuccess called.                                                                                                
12:25:25,795 INFO c.e.d.c.DemoController [parallel-2] [5ac5ec14] Mono returned from Controller doOnTerminate called.                                                                                              
12:25:25,796 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-2] [5ac5ec14] UNSUBSCRIBED -> REQUESTED                                                                                                
12:25:25,796 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-2] [5ac5ec14] Received onNext publisher                                                                                                
12:25:25,796 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-2] [5ac5ec14] REQUESTED -> RECEIVED                                                                                                    
12:25:25,796 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [5ac5ec14] UNSUBSCRIBED -> REQUESTED                                                                                                     
12:25:25,796 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [5ac5ec14] Item to write
12:25:25,796 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [5ac5ec14] REQUESTED -> RECEIVED                                                                                                         
12:25:25,796 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [5ac5ec14] isWritePossible: false                                                                                                        
12:25:25,796 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [5ac5ec14] No more items to write                                                                                                        
12:25:25,796 TRACE _.s.h.s.r.WriteResultPublisher [parallel-2] [5ac5ec14] UNSUBSCRIBED subscribe: org.springframework.http.server.reactive.AbstractListenerWriteFlushProcessor$State$WriteResultSubscriber@7c2c4f9b
12:25:25,796 TRACE _.s.h.s.r.WriteResultPublisher [parallel-2] [5ac5ec14] SUBSCRIBING request: 9223372036854775807                                                                                                
12:25:25,796 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-2] [5ac5ec14] Received onComplete                                                                                                      
12:25:25,796 TRACE _.s.h.s.r.WriteResultPublisher [parallel-2] [5ac5ec14] UNSUBSCRIBED subscribe: reactor.core.publisher.MonoNext$NextSubscriber@79b742d0                                                         
12:25:25,796 TRACE _.s.h.s.r.WriteResultPublisher [parallel-2] [5ac5ec14] SUBSCRIBING request: 9223372036854775807                                                                                                
12:26:25,730 TRACE _.s.h.s.r.AbstractListenerReadPublisher [parallel-1] [5ac5ec14] Cancellation
12:26:25,730 TRACE _.s.h.s.r.WriteResultPublisher [parallel-1] [5ac5ec14] SUBSCRIBED cancel
12:26:25,731 ERROR c.e.d.f.DemoWebFilter [parallel-1] [5ac5ec14] Response timeout after 60004 milliseconds for POST request with uri 'http://localhost:8080/post'. Response status code was already committed: '200 OK'.
12:26:25,731 TRACE o.s.h.s.r.ServletHttpHandlerAdapter [parallel-1] [5ac5ec14] Handling completed
12:26:25,731 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-140] [5ac5ec14] onAllDataRead                                                                                                      
12:26:25,731 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-140] [5ac5ec14] Received request to cancel                                                                                   
12:26:25,731 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-140] [5ac5ec14] Received onComplete                                                                                          
12:26:25,731 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [http-nio-8080-exec-140] [5ac5ec14] Cancellation                                                                                                      
12:26:25,731 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [http-nio-8080-exec-140] [5ac5ec14] No more items to write

This is trace was obtained running the demo as in this feature branch (but also specifying the local spring-framework project with your original fix applied in settings.gradle as mentioned previously): https://github.com/danielra/webflux_race_investigation/tree/static-delay-and-response-size

With the wrk load and sleeps run as follows:

for i in {1..100}; do   ./wrk -t 2 -c 1000 -d 10s --latency --timeout 9s --script ./myPost.lua http://localhost:8080/post; sleep 10; done

I did see some errors like the following in the logs as well, but they are apparently not time correlated with the response timeouts - so I don't think they are related:

12:25:15,592 ERROR o.a.c.h.Http11NioProtocol [http-nio-8080-exec-142] Error reading request, ignored
java.lang.IllegalStateException: Calling [asyncPostProcess()] is not valid for a request with Async state [ERROR]
12:24:35,284 ERROR o.a.c.h.Http11NioProtocol [http-nio-8080-exec-115] Error reading request, ignored
java.lang.IllegalStateException: Calling [asyncPostProcess()] is not valid for a request with Async state [ERROR]

Comment From: rstoyanchev

I've changed to the exact same settings from your latest commit. I can see timeouts but all are waiting on a write callback from Tomcat or a flush that's stuck.

wrk is called with --timeout 9s still so some connections are probably getting closed from the client side. Furthermore, running wrk for 10 second intervals probably means more connections get closed when wrk closes.

If I remove the --timeout parameter from the wrk command, use -d 5m instead of looping for 10 seconds on each iteration, I no longer see timeouts. If remove only the --timeout and increase the server timeout to 120 seconds, I see timeouts and virtually all of them are stuck trying to flush.

So perhaps connections getting closed from the client side lead under load. From what I can see though this is in Tomcat and I'm not sure what we can do other than set a timeout perhaps.

@markt-asf, I'm wondering if you have any comments on this one? Basically WebFlux on Tomcat as in #26407, under load with some connections getting closed from the client side due to a timeout. We're seeing that either onWritePossible never comes or a flush never completes.

Comment From: markt-asf

Possibly an instance of https://bz.apache.org/bugzilla/show_bug.cgi?id=65001 ? Updating to Spring Boot 2.4.3 should pick up Tomcat 9.0.43 and the fix for that issue. Better would be to pick up the latest Tomcat 9.0.x from the ASF snapshot repo and test that as there is a potential fix for #26407 in the current 9.0.x dev code.

Comment From: rstoyanchev

I have tried with 9.0-SNAPSHOT and it didn't help. It's something else probably.

The case here is also related to clients disconnecting. When the wrk client is set to run in a loop for 10 seconds in each iteration, it's relative easy to demonstrate. If it's set to run continuously, e.g. for 5 minutes (and without any --timeout) then it doesn't happen. In all cases I've seen, the input is read fully and after that, either outputStream.isReady() returns false after the first (and final) write and before the final flush, or it hangs in the final flush with no further log messages that should otherwise appear immediately after we come out from the flush.

Comment From: markt-asf

Do you have a WAR file I can use to reproduce this on Tomcat?

Comment From: rstoyanchev

This is the sample. You can add id 'war' around here and then ./gradlew bootWar will create a war.

For the load:

for i in {1..100}; do   ${pathToWrk}/wrk -t 2 -c 1000 -d 10s --latency --timeout 9s --script ./myPost.lua http://localhost:8080/post; done

and the myPost.lua script:

wrk.method = "POST"
wrk.body   = "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]"
wrk.headers["Content-Type"] = "application/json"

The use of ThreadLocalRandom in the controller seems to make matters worse. Arguably code that can block shouldn't be used in non-blocking handling. Here is a branch where this is replaced with non-random, fixed values.

Comment From: markt-asf

All I get if I do that is 404 responses. Looking in JMX it does not appear that any application Servlets or Filters are configured.

Comment From: rstoyanchev

Sorry, my instructions were incomplete. I don't often run Boot apps as a war and I'm having trouble with it. Let me ask, would it work for you to run it in embedded mode, via ./gradlew bootRun or within the IDE just starting DemoApplication as a regular Java class? Or do you need it to be a war archive?

Comment From: markt-asf

I need to be able to run the test case using my local Tomcat build and to be able to connect my IDE to perform remote debugging. Running as a Java class in the IDE should allow me to do both of those (with a few tweaks to the project configuration) if generating a WAR is proving tricky.

Comment From: bclozel

@rstoyanchev Sorry for misleading you earlier today - Spring Boot doesn't support WAR packaging for WebFlux applications. So this is not a typo or misconfiguration on your side, this is just something we don't support.

Comment From: markt-asf

I have this running locally now. Not much luck reproducing any errors though. Some idea of how often I should expect to see errors for a given load command would be helpful.

Comment From: danielra

FWIW, for me on my machine I was typically able to reproduce the issue multiple times within a minute or so on the main branch of the demo project which has random delays / response sizes. With the static values in the static-delay-and-response-size branch I only saw one repro over my entire run of:

for i in {1..100}; do   ./wrk -t 2 -c 1000 -d 10s --latency --timeout 9s --script ./myPost.lua http://localhost:8080/post; sleep 10; done

for which I posted the trace above.

Comment From: markt-asf

Thanks. It is taking a little longer for me but I am seeing the errors now. Also, there is a direct correlation between the two errors. For every

java.lang.IllegalStateException: Calling [asyncPostProcess()] is not valid for a request with Async state [ERROR]

I later see one

[parallel-1] ERROR com.example.demo.filter.DemoWebFilter - [60407469] Response timeout after 10002 milliseconds for null request with uri 'http://localhost:8080/post'. Response status code was already committed: '200 OK'.

I don't think I am going to get much further with this today. I'll come back to it tomorrow.

Comment From: rstoyanchev

No worries, thanks for taking a look!

The error "Response status code was already committed" comes from HttpWebHandlerAdapter and is based on a check of an internal variable in AbstractServerHttpResponse. This variable is set to committed when the application has called the writeWith method on the response or setComplete (without a body) so it is mostly an internal check to prevent setting the response twice.

The purpose of the DemoWebFilter on the other hand is time out requests that haven't completed. When the timeout kicks in, the application controller has already returned and we've started writing and therefore the response is committed from the point of view of AbstractServerHttpResponse. So as the timeout error signal propagates and flows through the end of the WebFilter chain, the error message indicates it's too late to set a different error status. This is expected.

Comment From: rstoyanchev

I'm not sure what "Calling [asyncPostProcess()] is not valid for a request with Async state" is related to. One thing I can point out is that in case response handling completes with an error signal, if the underlying Tomcat response isn't committed yet, we do a dispatch in order to raise the error on a container thread and finish the response with an actual error status.

Comment From: danielra

@markt-asf FWIW in my test run posted above here I saw two of the asyncPostProcesss IllegalStateException log entries:

12:25:15,592 ERROR o.a.c.h.Http11NioProtocol [http-nio-8080-exec-142] Error reading request, ignored
java.lang.IllegalStateException: Calling [asyncPostProcess()] is not valid for a request with Async state [ERROR]
12:24:35,284 ERROR o.a.c.h.Http11NioProtocol [http-nio-8080-exec-115] Error reading request, ignored
java.lang.IllegalStateException: Calling [asyncPostProcess()] is not valid for a request with Async state [ERROR]

But only one log entry indicating response timeout (repro of originally reported issue):

12:26:25,731 ERROR c.e.d.f.DemoWebFilter [parallel-1] [5ac5ec14] Response timeout after 60004 milliseconds for POST request with uri 'http://localhost:8080/post'. Response status code was already committed: '200 OK'.

And in this case there didn't appear to me to be obvious timing correlation between the two types of entries - the closest timing wise appears to have been logged at 12:25:15,592 which was ~10 seconds before the first log line in the captured trace from the demo project logging:

12:25:25,726 INFO c.e.d.f.DemoWebFilter [http-nio-8080-exec-111] [5ac5ec14] Incoming request observed.

Comment From: markt-asf

Thanks again for the test case. With a few tweaks I was able to narrow down the root cause and build a test case for Tomcat that triggered the error every time. There were multiple things going on:

  • The default error handling attempts to flush the response body if the response has been committed. The idea is to get what we can to the client. However, after an I/O error on write this flush is pointless as it triggers its own I/O error that then overwrites the original error. I've modified the default error handling to skip the flush when handling an I/O error on write. That made debugging a little easier.
  • If an I/O error occurred during the initial non-blocking write in WriteListener.onWritePossible() (rather than on any subsequent background write used to write any data that wasn't written in the initial write) AsyncListener.onError was not triggered. This has been fixed. Incidentally, this is why the error was observed intermittently. An I/O error was triggered by the client disconnect. If that I/O error happened during a background write everything worked as expected. If the I/O occurred in the initial non-blocking write then you would see the problem described here.
  • The Tomcat code assumed that IOExceptions during WriteListener.onWritePossible() and ReadListener.onDataAvaialble() were not caught and would be visible to the container. WebFlux swallows them. The expected behaviour here is not explicitly called out in the specification. You can construct an argument for the behaviour Tomcat expects from the existence of the throws IOException and the associated Javadoc but it rests on inference rather than a clearly documented specification requirement. Since Tomcat knows if an IOException is thrown in those methods, I have modified the calling code handle any such exception the same way irrespective of whether or not the application provided listeners swallow the exception or re-thrown it. I'll raise an issue with the Jakarta Servlet project to get the behaviour here clarified.

This issues have all been fixed in:

  • 10.0.x for 10.0.3 onwards
  • 9.0.x for 9.0.44 onwards
  • 8.5.x for 8.5.64 onwards

With these fixes in place, I ran the provided test case for 20 mins (10s wrk, 10 sleep, repeat) without observing any issues.

A snapshot build with these fixes should be available in the ASF snapshot repository in the next couple of hours. (There is a bit of a CI backlog at the moment.)

Comment From: danielra

Thanks for the update! I believe I've pulled in the correct 9.0-SNAPSHOT version of Tomcat, and I also upgraded to spring-boot 2.4.3 to pull in the prior fix on the spring-framework side. Unfortunately, I am still seeing repros of the response timeout with my demo project - though the traces do look a bit different now, and seem to actually be a bit more frequent. I've pushed a new feature branch of the demo project which I'm using to repro here: https://github.com/danielra/webflux_race_investigation/tree/tomcat-snapshot-version (Note that this branch is also using the randomized delay / response size since historically I've had quicker repros with that setup.)

One detail I notice is that the wrk output actually reports some socket "read" errors now, which I hadn't observed before. For example:

Running 10s test @ http://localhost:8080/post
  2 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.73s   595.64ms   3.07s    59.54%
    Req/Sec   264.52    125.69     1.09k    77.60%
  Latency Distribution
     50%    1.68s
     75%    2.16s
     90%    2.65s
     99%    2.95s
  5106 requests in 10.07s, 1.38GB read
  Socket errors: connect 0, read 23, write 0, timeout 0
Requests/sec:    507.16
Transfer/sec:    140.61MB

The traces I have seen so far when the response timeout reproduces look like this:

16:39:01,777 INFO c.e.d.f.DemoWebFilter [http-nio-8080-exec-32] [169254d] Incoming request observed.
16:39:01,777 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-32] [169254d] UNSUBSCRIBED -> SUBSCRIBING                                                                                          
16:39:01,777 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-32] [169254d] 9223372036854775807 requested                                                                                        
16:39:01,777 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-32] [169254d] SUBSCRIBING -> DEMAND                                                                                                
16:39:01,777 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-32] [169254d] onDataAvailable                                                                                                      
16:39:01,777 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-32] [169254d] DEMAND -> READING                                                                                                    
16:39:01,777 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-32] [169254d] Read 8092 bytes                                                                                                      
16:39:01,777 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-32] [169254d] Publishing data read                                                                                                 
16:39:01,777 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-32] [169254d] Read 8192 bytes                                                                                                      
16:39:01,777 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-32] [169254d] Publishing data read                                                                                                 
16:39:01,777 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-32] [169254d] Read 4067 bytes                                                                                                      
16:39:01,777 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-32] [169254d] Publishing data read                                                                                                 
16:39:01,777 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-32] [169254d] No more data to read                                                                                                 
16:39:01,777 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-32] [169254d] READING -> DEMAND                                                                                                    
16:39:01,777 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-32] [169254d] onAllDataRead                                                                                                        
16:39:01,777 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-32] [169254d] DEMAND -> COMPLETED                                                                                                  
16:39:01,777 INFO c.e.d.c.DemoController [http-nio-8080-exec-32] [169254d] Controller called. Delay: 6. Size: 82829                                                                                               
16:39:01,777 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-32] [169254d] Connection error: java.io.IOException: Broken pipe                                                                   
16:39:01,777 DEBUG o.s.h.s.r.ServletHttpHandlerAdapter [http-nio-8080-exec-32] [169254d] Error notification: java.io.IOException: Broken pipe                                                                     
16:39:01,777 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-32] [169254d] onAllDataRead                                                                                                        
16:39:01,954 INFO c.e.d.c.DemoController [parallel-4] [169254d] Mono returned from Controller doOnSuccess called.                                                                                                 
16:39:01,954 INFO c.e.d.c.DemoController [parallel-4] [169254d] Mono returned from Controller doOnTerminate called.                                                                                               
16:39:01,971 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-4] [169254d] UNSUBSCRIBED -> REQUESTED                                                                                                 
16:39:01,971 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-4] [169254d] Received onNext publisher                                                                                                 
16:39:01,971 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-4] [169254d] REQUESTED -> RECEIVED                                                                                                     
16:39:01,971 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-4] [169254d] UNSUBSCRIBED -> REQUESTED                                                                                                      
16:39:01,971 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-4] [169254d] Item to write
16:39:01,971 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-4] [169254d] REQUESTED -> RECEIVED
16:39:01,971 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-4] [169254d] isWritePossible: false                                                                                                         
16:39:01,971 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-4] [169254d] No more items to write                                                                                                         
16:39:01,971 TRACE _.s.h.s.r.WriteResultPublisher [parallel-4] [169254d] UNSUBSCRIBED subscribe: org.springframework.http.server.reactive.AbstractListenerWriteFlushProcessor$State$WriteResultSubscriber@5450655d
16:39:01,971 TRACE _.s.h.s.r.WriteResultPublisher [parallel-4] [169254d] SUBSCRIBING request: 9223372036854775807                                                                                                 
16:39:01,971 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-4] [169254d] Received onComplete                                                                                                       
16:39:01,971 TRACE _.s.h.s.r.WriteResultPublisher [parallel-4] [169254d] UNSUBSCRIBED subscribe: reactor.core.publisher.MonoNext$NextSubscriber@356c084d                                                          
16:39:01,971 TRACE _.s.h.s.r.WriteResultPublisher [parallel-4] [169254d] SUBSCRIBING request: 9223372036854775807                                                                                                 
16:39:21,778 TRACE _.s.h.s.r.AbstractListenerReadPublisher [parallel-3] [169254d] Cancellation
16:39:21,778 TRACE _.s.h.s.r.WriteResultPublisher [parallel-3] [169254d] SUBSCRIBED cancel
16:39:21,779 ERROR c.e.d.f.DemoWebFilter [parallel-3] [169254d] Response timeout after 20002 milliseconds for null request with uri 'http://localhost:8080/post'. Response status code was already committed: '200 OK'.
16:39:21,779 TRACE o.s.h.s.r.ServletHttpHandlerAdapter [parallel-3] [169254d] Handling completed

The log lines in the above trace indicating the "Broken pipe" IOException, which apparently occur near the end of reading a request aren't something I had observed previously. It's interesting to me that the application appears to proceed with attempting to write a response after having observed this error - it makes me wonder if there is something on the WebFlux side that isn't appropriately handling the error and cancelling the request, but I haven't had time to dig into any more details there yet.

Please let me know if there is any issue with the way I've pulled in the Tomcat snapshot version here - this is not something I had done before.

Thanks!

Comment From: rstoyanchev

Thanks Mark! I confirm I can no longer reproduce the issue with 9.0-SNAPSHOT. @danielra if you could please also check from your end that this addresses the issue.

Comment From: rstoyanchev

Never mind, I see now you did respond.

The log lines in the above trace indicating the "Broken pipe" IOException, which apparently occur near the end of reading a request aren't something I had observed previously.

I have not reproduced this but based on this output, it does appear like there is an issue in WebFlux. Given that reading has completed, we no longer can send an error signal on the request body side. Yet the writing hasn't started yet, and there is no write Publisher to cancel yet. So we close the AsyncContext and yet the controller remains unaware. This may actually help to explain #26407. I'll see if I can reproduce it more reliably.

Comment From: markt-asf

For now I am going to assume the Tomcat aspects of this have been resolved but feel free to ping me if it starts to look like that might not be the case.

Comment From: rstoyanchev

I was able to reproduce the hanging issue with an "java.io.IOException: Broken pipe" that happens after reading and before writing. I used the tomcat-snapshot-version branch and changed the wrk command to shorten iterations to 5 secs (from 10) and decrease the timeout to 1 sec and it reproduced easily:

for i in {1..50}; do ../wrk/wrk -t 2 -c 1000 -d 5s --latency --timeout 1s --script ./myPost.lua http://localhost:8080/post; done

I've committed a fix and with that fix, that error is gone. However, I get a new issue with that same setup. This time reading completes, then writing begins 1-3 seconds later, then we get false from isReady to write, and then there are no further notifications from Tomcat until the timeout from DemoWebFilter. Here is example output (with a few comments inserted):

2021-02-26 13:40:33,457 INFO c.e.d.f.DemoWebFilter [http-nio-8080-exec-146] [228ecfd9] Incoming request observed.
2021-02-26 13:40:33,558 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-146] [228ecfd9] UNSUBSCRIBED -> SUBSCRIBING
2021-02-26 13:40:33,558 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-146] [228ecfd9] 9223372036854775807 requested
2021-02-26 13:40:33,558 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-146] [228ecfd9] SUBSCRIBING -> DEMAND
2021-02-26 13:40:33,558 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-146] [228ecfd9] onDataAvailable
2021-02-26 13:40:33,558 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-146] [228ecfd9] DEMAND -> READING
2021-02-26 13:40:33,558 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-146] [228ecfd9] Read 54 bytes
2021-02-26 13:40:33,558 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-146] [228ecfd9] Publishing data read
2021-02-26 13:40:33,558 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-146] [228ecfd9] No more data to read
2021-02-26 13:40:33,558 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-146] [228ecfd9] READING -> DEMAND
2021-02-26 13:40:33,558 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-146] [228ecfd9] onAllDataRead
2021-02-26 13:40:33,558 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-146] [228ecfd9] DEMAND -> COMPLETED

# above reading is done, below writing begins

2021-02-26 13:40:34,650 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-2] [228ecfd9] UNSUBSCRIBED -> REQUESTED
2021-02-26 13:40:34,650 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-2] [228ecfd9] Received onNext publisher
2021-02-26 13:40:34,650 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-2] [228ecfd9] REQUESTED -> RECEIVED
2021-02-26 13:40:34,650 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [228ecfd9] UNSUBSCRIBED -> REQUESTED
2021-02-26 13:40:34,650 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [228ecfd9] Item to write
2021-02-26 13:40:34,650 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [228ecfd9] REQUESTED -> RECEIVED
2021-02-26 13:40:34,650 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [228ecfd9] isWritePossible: false

# above we have a buffer to write but isReady returned false

2021-02-26 13:40:34,650 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-2] [228ecfd9] No more items to write
2021-02-26 13:40:34,650 TRACE _.s.h.s.r.WriteResultPublisher [parallel-2] [228ecfd9] UNSUBSCRIBED subscribe: org.springframework.http.server.reactive.AbstractListenerWriteFlushProcessor$State$WriteResultSubscriber@4850f39b
2021-02-26 13:40:34,650 TRACE _.s.h.s.r.WriteResultPublisher [parallel-2] [228ecfd9] SUBSCRIBING request: 9223372036854775807
2021-02-26 13:40:34,650 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-2] [228ecfd9] Received onComplete
2021-02-26 13:40:34,650 TRACE _.s.h.s.r.WriteResultPublisher [parallel-2] [228ecfd9] UNSUBSCRIBED subscribe: reactor.core.publisher.MonoNext$NextSubscriber@627ec8a0
2021-02-26 13:40:34,650 TRACE _.s.h.s.r.WriteResultPublisher [parallel-2] [228ecfd9] SUBSCRIBING request: 9223372036854775807

# below DemoWebFilter triggers timeout after 10 seconds of inactivity

2021-02-26 13:40:54,823 TRACE _.s.h.s.r.AbstractListenerReadPublisher [parallel-3] [228ecfd9] Cancellation
2021-02-26 13:40:54,823 TRACE _.s.h.s.r.WriteResultPublisher [parallel-3] [228ecfd9] SUBSCRIBED cancel
2021-02-26 13:40:54,823 ERROR c.e.d.f.DemoWebFilter [parallel-3] [228ecfd9] Response timeout after 21366 milliseconds for POST request with uri 'http://localhost:8080/post'. Response status code was already committed: '200 OK'.
2021-02-26 13:40:54,824 TRACE o.s.h.s.r.ServletHttpHandlerAdapter [parallel-3] [228ecfd9] Handling completed
2021-02-26 13:40:54,824 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-21] [228ecfd9] onAllDataRead
2021-02-26 13:40:54,824 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-21] [228ecfd9] Received request to cancel
2021-02-26 13:40:54,824 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-21] [228ecfd9] Received onComplete
2021-02-26 13:40:54,824 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [http-nio-8080-exec-21] [228ecfd9] Cancellation
2021-02-26 13:40:54,824 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [http-nio-8080-exec-21] [228ecfd9] No more items to write

@markt-asf I suspect something possibly remains on the Tomcat side. In order to reproduce the above, please first make sure you're using a Spring Framework 5.3.5 snapshot, in order to not run into the fixed issue. Update the below sections in build.gradle:

repositories {
    mavenCentral()
    maven { url "https://repository.apache.org/content/groups/snapshots" }
    maven { url "https://repo.spring.io/snapshot" }
}

ext['tomcat.version'] = '9.0-SNAPSHOT'
ext['spring-framework.version'] = '5.3.5-SNAPSHOT'

Comment From: markt-asf

Having updated to the new branch and the latest JARs, I am seeing a lot of these in the logs that I didn't before:

[parallel-1] ERROR reactor.core.publisher.Operators - Operator called default onErrorDropped
java.io.IOException: Broken pipe
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: 
Error has been observed at the following site(s):
    |_ checkpoint ⇢ Handler com.example.demo.controller.DemoController#post(ServerWebExchange, List) [DispatcherHandler]
Stack trace:
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470)
        at org.apache.tomcat.util.net.NioChannel.write(NioChannel.java:135)
        at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite(NioEndpoint.java:1383)
        at org.apache.tomcat.util.net.SocketWrapperBase.doWrite(SocketWrapperBase.java:766)
        at org.apache.tomcat.util.net.SocketWrapperBase.writeNonBlockingInternal(SocketWrapperBase.java:682)
        at org.apache.tomcat.util.net.SocketWrapperBase.writeNonBlocking(SocketWrapperBase.java:660)
        at org.apache.tomcat.util.net.SocketWrapperBase.write(SocketWrapperBase.java:532)
        at org.apache.coyote.http11.Http11OutputBuffer$SocketOutputBuffer.doWrite(Http11OutputBuffer.java:546)
        at org.apache.coyote.http11.filters.IdentityOutputFilter.doWrite(IdentityOutputFilter.java:73)
        at org.apache.coyote.http11.Http11OutputBuffer.doWrite(Http11OutputBuffer.java:193)
        at org.apache.coyote.Response.doWrite(Response.java:602)
        at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:339)
        at org.apache.catalina.connector.OutputBuffer.appendByteBuffer(OutputBuffer.java:766)
        at org.apache.catalina.connector.OutputBuffer.append(OutputBuffer.java:730)
        at org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:405)
        at org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:376)
        at org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:105)
        at org.springframework.http.server.reactive.TomcatHttpHandlerAdapter$TomcatServerHttpResponse.writeToOutputStream(TomcatHttpHandlerAdapter.java:241)
        at org.springframework.http.server.reactive.ServletServerHttpResponse$ResponseBodyProcessor.write(ServletServerHttpResponse.java:361)
        at org.springframework.http.server.reactive.ServletServerHttpResponse$ResponseBodyProcessor.write(ServletServerHttpResponse.java:331)
        at org.springframework.http.server.reactive.AbstractListenerWriteProcessor$State$3.onWritePossible(AbstractListenerWriteProcessor.java:374)
        at org.springframework.http.server.reactive.AbstractListenerWriteProcessor.onWritePossible(AbstractListenerWriteProcessor.java:153)
        at org.springframework.http.server.reactive.AbstractListenerWriteProcessor.writeIfPossible(AbstractListenerWriteProcessor.java:293)
        at org.springframework.http.server.reactive.AbstractListenerWriteProcessor.changeStateToReceived(AbstractListenerWriteProcessor.java:272)
        at org.springframework.http.server.reactive.AbstractListenerWriteProcessor.access$400(AbstractListenerWriteProcessor.java:46)
        at org.springframework.http.server.reactive.AbstractListenerWriteProcessor$State$2.onNext(AbstractListenerWriteProcessor.java:353)
        at org.springframework.http.server.reactive.AbstractListenerWriteProcessor.onNext(AbstractListenerWriteProcessor.java:117)
        at reactor.core.publisher.StrictSubscriber.onNext(StrictSubscriber.java:89)
        at reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.onNext(FluxContextWrite.java:107)
        at reactor.core.publisher.FluxPeekFuseable$PeekFuseableConditionalSubscriber.onNext(FluxPeekFuseable.java:503)
        at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1815)
        at reactor.core.publisher.MonoCallable.subscribe(MonoCallable.java:61)
        at reactor.core.publisher.Mono.subscribe(Mono.java:4046)
        at org.springframework.http.server.reactive.AbstractListenerWriteFlushProcessor$State$2.onNext(AbstractListenerWriteFlushProcessor.java:270)
        at org.springframework.http.server.reactive.AbstractListenerWriteFlushProcessor.onNext(AbstractListenerWriteFlushProcessor.java:103)
        at org.springframework.http.server.reactive.AbstractListenerWriteFlushProcessor.onNext(AbstractListenerWriteFlushProcessor.java:43)
        at reactor.core.publisher.StrictSubscriber.onNext(StrictSubscriber.java:89)
        at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2397)
        at reactor.core.publisher.StrictSubscriber.onSubscribe(StrictSubscriber.java:77)
        at reactor.core.publisher.MonoJust.subscribe(MonoJust.java:54)
        at reactor.core.publisher.Mono.subscribe(Mono.java:4046)
        at org.springframework.http.server.reactive.AbstractListenerServerHttpResponse.lambda$writeAndFlushWithInternal$0(AbstractListenerServerHttpResponse.java:65)
        at reactor.core.publisher.MonoFromPublisher.subscribe(MonoFromPublisher.java:63)
        at reactor.core.publisher.Mono.subscribe(Mono.java:4046)
        at reactor.core.publisher.FluxConcatArray$ConcatArraySubscriber.onComplete(FluxConcatArray.java:208)
        at reactor.core.publisher.FluxConcatArray.subscribe(FluxConcatArray.java:80)
        at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
        at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:157)
        at reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.onNext(FluxContextWrite.java:107)
        at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2397)
        at reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.request(FluxContextWrite.java:136)
        at reactor.core.publisher.MonoFlatMap$FlatMapMain.onSubscribe(MonoFlatMap.java:110)
        at reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.onSubscribe(FluxContextWrite.java:101)
        at reactor.core.publisher.MonoJust.subscribe(MonoJust.java:54)
        at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
        at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:157)
        at reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:73)
        at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1815)
        at reactor.core.publisher.MonoSingle$SingleSubscriber.onComplete(MonoSingle.java:185)
        at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onComplete(FluxMapFuseable.java:150)
        at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.onComplete(MonoPeekTerminal.java:299)
        at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.onComplete(MonoPeekTerminal.java:299)
        at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
        at reactor.core.publisher.MonoDelayElement$DelayElementSubscriber.lambda$onNext$0(MonoDelayElement.java:125)
        at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
        at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Is this expected? If so, what is the simplest way to suppress these without dropping any other logging?

Comment From: markt-asf

I'm also not seeing the Response timeout after ... log message. I'll try a longer run.

Comment From: rstoyanchev

This looks like Reactor indicating an error that is being dropped because the pipeline was already cancelled. You can override the handling of these with a global static hook, for example in the main method or as early as possible:

Hooks.onErrorDropped(throwable -> {
    // ignore
});

Comment From: danielra

Thanks for the follow-up. FWIW, I pulled down the new spring-framework snapshot as specified, and was able to reproduce some response timeouts with similar traces to what Rossen posted above within a minute or two of running the wrk command he specified.

One additional note is that I do still see the IllegalStateException messages like the following in the logs:

2021-02-26 16:51:18,457 ERROR o.a.c.h.Http11NioProtocol [http-nio-8080-exec-12] Error reading request, ignored
java.lang.IllegalStateException: Calling [asyncPostProcess()] is not valid for a request with Async state [ERROR]

I'm curious, Mark, in your investigations so far did you identify the source of those by chance? If they are indeed unrelated to the response timeouts reported here, I wonder if it would make sense to track a separate issue for those. Do you think those are an issue internal to Tomcat? Or more likely an issue with how Tomcat is being interacted with?

Comment From: markt-asf

The IllegalStateException calling asyncPostProcess() is a symptom of the same problem. The sequence is typically:

  • IOException such as client disconnect
  • IOException not handled correctly (this is the root cause and I suspect there are several variations of this)
  • not handling the IOException correctly typically (depends on timing) triggers an IllegalStateException
  • the IllegalStateException means the request does not end cleanly turn typically (again dependign on timing) triggers a timeout.

Given we have, essentially, a single test to reproduce the issues and no way to differentiate between the issues I see no benefit in creating separate issue.

Comment From: markt-asf

I believe I have found the remaining Tomcat issue. I am currently working on a test case and fix.

Comment From: markt-asf

Please re-test with the latest Apache Tomcat 9.0.x-SNAPSHOT

Comment From: danielra

I re-built the demo project against the spring-framework and Tomcat snapshot versions and did one run of:

for i in {1..50}; do ../wrk/wrk -t 2 -c 1000 -d 5s --latency --timeout 1s --script ./myPost.lua http://localhost:8080/post; done

During this run the response timeout issue occurred 11 times, each time with with traces similar to this one:

2021-03-03 10:35:03,346 INFO c.e.d.f.DemoWebFilter [http-nio-8080-exec-197] [64d3be7d] Incoming request observed.
2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] UNSUBSCRIBED -> SUBSCRIBING
2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] 9223372036854775807 requested
2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] SUBSCRIBING -> DEMAND
2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] onDataAvailable
2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] DEMAND -> READING
2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] Read 8092 bytes
2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] Publishing data read
2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] Read 8192 bytes
2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] Publishing data read
2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] Read 4067 bytes
2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] Publishing data read
2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] No more data to read
2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] READING -> DEMAND
2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] onAllDataRead
2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] DEMAND -> COMPLETED
2021-03-03 10:35:03,346 INFO c.e.d.c.DemoController [http-nio-8080-exec-197] [64d3be7d] Controller called. Delay: 1. Size: 70253
2021-03-03 10:35:03,855 INFO c.e.d.c.DemoController [parallel-3] [64d3be7d] Mono returned from Controller doOnSuccess called.
2021-03-03 10:35:03,855 INFO c.e.d.c.DemoController [parallel-3] [64d3be7d] Mono returned from Controller doOnTerminate called.
2021-03-03 10:35:03,858 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-3] [64d3be7d] UNSUBSCRIBED -> REQUESTED
2021-03-03 10:35:03,858 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-3] [64d3be7d] Received onNext publisher
2021-03-03 10:35:03,858 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-3] [64d3be7d] REQUESTED -> RECEIVED
2021-03-03 10:35:03,858 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [64d3be7d] UNSUBSCRIBED -> REQUESTED
2021-03-03 10:35:03,858 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [64d3be7d] Item to write
2021-03-03 10:35:03,858 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [64d3be7d] REQUESTED -> RECEIVED
2021-03-03 10:35:03,858 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [64d3be7d] isWritePossible: false
2021-03-03 10:35:03,858 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [64d3be7d] No more items to write
2021-03-03 10:35:03,858 TRACE _.s.h.s.r.WriteResultPublisher [parallel-3] [64d3be7d] UNSUBSCRIBED subscribe: org.springframework.http.server.reactive.AbstractListenerWriteFlushProcessor$State$WriteResultSubscriber@a021407
2021-03-03 10:35:03,858 TRACE _.s.h.s.r.WriteResultPublisher [parallel-3] [64d3be7d] SUBSCRIBING request: 9223372036854775807
2021-03-03 10:35:03,858 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-3] [64d3be7d] Received onComplete
2021-03-03 10:35:03,858 TRACE _.s.h.s.r.WriteResultPublisher [parallel-3] [64d3be7d] UNSUBSCRIBED subscribe: reactor.core.publisher.MonoNext$NextSubscriber@26c528bb
2021-03-03 10:35:03,858 TRACE _.s.h.s.r.WriteResultPublisher [parallel-3] [64d3be7d] SUBSCRIBING request: 9223372036854775807
2021-03-03 10:35:25,432 TRACE _.s.h.s.r.AbstractListenerReadPublisher [parallel-1] [64d3be7d] Cancellation
2021-03-03 10:35:25,432 TRACE _.s.h.s.r.WriteResultPublisher [parallel-1] [64d3be7d] SUBSCRIBED cancel
2021-03-03 10:35:25,432 ERROR c.e.d.f.DemoWebFilter [parallel-1] [64d3be7d] Response timeout after 22086 milliseconds for POST request with uri 'http://localhost:8080/post'. Response status code was already committed: '200 OK'.
2021-03-03 10:35:25,432 TRACE o.s.h.s.r.ServletHttpHandlerAdapter [parallel-1] [64d3be7d] Handling completed
2021-03-03 10:35:25,432 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-83] [64d3be7d] onAllDataRead
2021-03-03 10:35:25,432 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-83] [64d3be7d] Received request to cancel
2021-03-03 10:35:25,432 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-83] [64d3be7d] Received onComplete
2021-03-03 10:35:25,432 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [http-nio-8080-exec-83] [64d3be7d] Cancellation
2021-03-03 10:35:25,432 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [http-nio-8080-exec-83] [64d3be7d] No more items to write

I did not however see any occurrences of the IllegalStateException regarding asyncPostProcess.

Comment From: markt-asf

I don't know WebFlux well enough to figure out what is going on there. @rstoyanchev ?

Comment From: danielra

FWIW, to me, the trace seems to be similar to the one Rossen posted some comments / context on above here: https://github.com/spring-projects/spring-framework/issues/26434#issuecomment-786715388 Might that this is the same underlying issue as we were seeing there.

Comment From: rstoyanchev

Indeed it looks similar. The last I/O we see is an attempt to write while it's not possible:

2021-03-03 10:35:03,858 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [64d3be7d] Item to write
2021-03-03 10:35:03,858 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [64d3be7d] REQUESTED -> RECEIVED
2021-03-03 10:35:03,858 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [64d3be7d] isWritePossible: false
2021-03-03 10:35:03,858 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [64d3be7d] No more items to write

After that no further notifications from the container until the DemoFilter timeout about 22 seconds later.

Comment From: markt-asf

I'm not having much luck reproducing this at the moment. I'll keep trying.

Not having a good mental model of how WebFlux works isn't helping. In the trace am I correct in assuming that the read completes normally with no call to onError? Are there any successful writes before isReady() returns false?

There should have been a callback on onWritePossible() very shortly after the WriteListener was set. When WebFlux comes to write does it remember that it has been signalled it is OK to write or does it always call isReady() to check first?

I am seeing lots of each of the following errors. What do these mean?

2021-03-04 07:47:17.247 ERROR 45619 --- [io-8080-exec-95] o.s.w.s.a.HttpWebHandlerAdapter          : [2f8ce157] Error [java.io.IOException: No buffer space available] for HTTP POST "/post", but ServerHttpResponse already committed (200 OK)

and

2021-03-04 07:47:18.092 ERROR 45619 --- [io-8080-exec-47] o.s.w.s.a.HttpWebHandlerAdapter          : [6c298ad2] Error [java.io.IOException: Protocol wrong type for socket] for HTTP POST "/post", but ServerHttpResponse already committed (200 OK)

Comment From: rstoyanchev

Thanks Mark, let me provide some pointers.

In the trace am I correct in assuming that the read completes normally with no call to onError?

Yes. We here read until there was no more to read (isReady false), then switched to DEMAND (we've demand from downstream to read more), then onAllDataRead (container tells us input is done) and move to the final COMPLETED state:

2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] No more data to read
2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] READING -> DEMAND
2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] onAllDataRead
2021-03-03 10:35:03,346 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-197] [64d3be7d] DEMAND -> COMPLETED

Are there any successful writes before isReady() returns false?

No. Here there was only one item (buffer) to write but isWritePossible was false:

2021-03-03 10:35:03,858 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [64d3be7d] Item to write
2021-03-03 10:35:03,858 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [64d3be7d] REQUESTED -> RECEIVED
2021-03-03 10:35:03,858 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [64d3be7d] isWritePossible: false

There should have been a callback on onWritePossible() very shortly after the WriteListener was set. When WebFlux comes to write does it remember that it has been signalled it is OK to write or does it always call isReady() to check first?

No we don't anything to remember any onWritePossible calls before writing begins. When writing starts, isReady is checked first. Is this an issue?

HttpWebHandlerAdapter : [2f8ce157] Error [java.io.IOException: No buffer space available] HttpWebHandlerAdapter : [6c298ad2] Error [java.io.IOException: Protocol wrong type for socket]

Not familiar with those. A JDK network errors perhaps? They're logged in HttpWebHandlerAdapter at the end of request handling so it could have originated from either the read or from the write.

This makes me wonder if you're having trouble seeing the same behavior due to OS or JDK version differences? I'm running the sample with OpenJDK 15 on Ubuntu.

Comment From: markt-asf

Thanks Rossen. This helps a lot.

Are there any successful writes before isReady() returns false?

No. Here there was only one item (buffer) to write but isWritePossible was false:

There are only two reasons I can see why isReady() will return false. The first is if there is data in output buffer that has yet to be written. That should not be the case if nothing has been written so far. The second is if there is no WriteListener registered. Before AsyncListener.onError() is fired, both the ReadListener and the WriteListener are cleared.

It would be really helpful to understand why isReady() is returning false at this point.

Has anything at all been written to the response at this point? Has it been committed for example? Any possibility of flush() being called? I am currently looking to see if I can find any way that Tomcat could fail to notify the application of a dropped client connection if there has been an incomplete write to the client. I haven't found anything yet.

Is it possible that AsyncListener.onError() has been fired but not processed?

There should have been a callback to onWritePossible() very shortly after the WriteListener was set. When WebFlux comes to write does it remember that it has been signalled it is OK to write or does it always call isReady() to check first?

No it doesn't do anything to remember any onWritePossible calls before writing begins. When it starts writing, it calls isReady to check first. Is this an issue?

I don't think so. Just trying to understand the sequence of interactions with the Servlet API so I know where to focus my attention.

HttpWebHandlerAdapter : [2f8ce157] Error [java.io.IOException: No buffer space available] HttpWebHandlerAdapter : [6c298ad2] Error [java.io.IOException: Protocol wrong type for socket]

Not familiar with those errors. A JDK network error perhaps? It is logged by HttpWebHandlerAdapter, i.e. at the end of request handling. So it could have originated from the read or from the write.

This makes me wonder if you're having trouble seeing the same behavior due to OS or JDK version differences? I'm running the sample with OpenJDK 15 on Ubuntu.

Further research suggests that those errors might be due to hitting various OS limits. Fixing that might make is easier to reproduce the issue.

I have been unable to reproduce this with Java 8 on Ubuntu. I had more success with Java 8 on OSX. There is certainly a timing component to this so hardware, OS and JRE are all likely to be a factor.

Comment From: rstoyanchev

It would be really helpful to understand why isReady() [while reading] is returning false at this point.

My bad. The read is protected with a check to isReady but in the readFromInputStream() method, we will also return null if 0 bytes were read. I think reading in this case simply went through from start to end without pausing.

Has anything at all been written to the response at this point? Has it been committed for example? Any possibility of flush() being called?

I presume you mean at the point of reading? If so, then I don't think so. In order for DemoController to be invoked, the input has to be read and deserialized from JSON. Only then can the controller return and writing can begin. Furthermore the controller has a built-in delay which in this instance was only 1 ms but even with 0 delay, there is a strict before and after relationship between reading and writing in a controller written like this.

Is it possible that AsyncListener.onError() has been fired but not processed?

In the current code, we have 3 AsyncListener's, for the request body, for the response body, and for the overall result of request handling. They are registered in that order. Only the last one logs a message so as long as all 3 are invoked, there should be a log message. Furthermore, for the first two, the immediate code they invoke should log as well. So in all I'd say not I don't think it's possible to miss it.

That said I have been preparing a couple of commits. One consolidates the AsyncListener's invocation so that only one is registered and the rest are delegated to. This should guarantee logging as the very first thing. I've also been further improving logging in this part of the code base and when I commit those two, I will get some more logging and output with cases that I come across.

Comment From: markt-asf

Sorry, I meant ServletOutputStream.isReady(). It is the writes I am focusing on. I think the reads are completing normally. With a switch to Java 15 and an increase in load, I am now seeing a small number of failures but still not on every run. Being able to reproduce it means I can make progress, but the small number of failures means that progress is relatively slow.

Comment From: danielra

FWIW, my setup on which I've been fairly reliably reproducing the issue consists of Amazon Corretto 8 JDK (version 8.282.08.1-linux-x64), on Debian 10, on a ~10 year old Thinkpad X220 "ultra portable" laptop with an i7 intel processor.

Comment From: markt-asf

After some extensive testing today, I have concluded that there is no further Tomcat issue here. By adding additional logging to Tomcat I can see the following sequence of events (there are minor differences depending on how many non-blocking writes there are before one of them can't be completed):

  • App performs a non-blocking write
  • Not all of the data can be written so some data is placed in Tomcat's internal write buffer and the socket is registered for write with the Poller
  • The client closes the socket
  • The Poller does not detect the socket closure
  • The async timeout fires and cleans up

I suspect that the Poller would detect the closure given enough time.

Tomcat is at the mercy of the JVM and the OS in this case.

Whether there is a JVM bug here (the Poller not reporting a socket closure it should report) I can;t say.

Comment From: rstoyanchev

Thanks for all the help looking into this.

On my side I have done some cleanup of the log messages, the handling of notifications via Read/WriteListener and AsyncListener as well as the logic for cancellation from a filter such as the DemoFilter. None of that solves the issue but it makes exception logic and related logging a little cleaner and hopefully less confusing.

Regarding the Poller not firing, I tried the following experiment to confirm. In DemoFilter, I increased the timeout to 120 seconds and added a log message for requests that take more than 20 seconds to complete. In other words I want so see some that take much longer but do eventually complete.

What I got was a number of requests that did manage to complete on their own after a significant amount of time. Here is one that got a SocketTimeoutException after 62 seconds:

2021-03-04 21:17:54,848 INFO c.e.d.f.DemoWebFilter [http-nio-8080-exec-185] [11ea62a0] Incoming request observed.
2021-03-04 21:17:54,849 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-185] [11ea62a0] UNSUBSCRIBED -> SUBSCRIBING
2021-03-04 21:17:54,952 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-185] [11ea62a0] request Long.MAX_VALUE
2021-03-04 21:17:54,970 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-185] [11ea62a0] SUBSCRIBING -> DEMAND
2021-03-04 21:17:54,970 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-185] [11ea62a0] onDataAvailable
2021-03-04 21:17:54,970 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-185] [11ea62a0] DEMAND -> READING
2021-03-04 21:17:54,970 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-185] [11ea62a0] Read 54 bytes
2021-03-04 21:17:54,970 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-185] [11ea62a0] Publishing DefaultDataBuffer
2021-03-04 21:17:54,982 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-185] [11ea62a0] No more to read
2021-03-04 21:17:54,982 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-185] [11ea62a0] READING -> DEMAND
2021-03-04 21:17:54,982 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-185] [11ea62a0] onAllDataRead [DEMAND]
2021-03-04 21:17:54,982 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-185] [11ea62a0] DEMAND -> COMPLETED
2021-03-04 21:17:56,181 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-3] [11ea62a0] UNSUBSCRIBED -> REQUESTED
2021-03-04 21:17:56,181 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-3] [11ea62a0] onNext: "write" Publisher
2021-03-04 21:17:56,181 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-3] [11ea62a0] REQUESTED -> RECEIVED
2021-03-04 21:17:56,181 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [11ea62a0] UNSUBSCRIBED -> REQUESTED
2021-03-04 21:17:56,181 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [11ea62a0] onNext: DefaultDataBuffer
2021-03-04 21:17:56,181 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [11ea62a0] REQUESTED -> RECEIVED
2021-03-04 21:17:56,181 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [11ea62a0] onWritePossible
2021-03-04 21:17:56,181 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [11ea62a0] RECEIVED -> WRITING
2021-03-04 21:17:56,182 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [11ea62a0] Wrote 507669 of 507669 bytes
2021-03-04 21:17:56,182 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [11ea62a0] WRITING -> REQUESTED
2021-03-04 21:17:56,182 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [11ea62a0] onComplete [REQUESTED]
2021-03-04 21:17:56,182 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [11ea62a0] REQUESTED -> RECEIVED
2021-03-04 21:17:56,182 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [11ea62a0] isWritePossible false
2021-03-04 21:17:56,182 TRACE _.s.h.s.r.WriteResultPublisher [parallel-3] [11ea62a0] [WP] got subscriber ResponseBodyFlushProcessor-WriteResultSubscriber
2021-03-04 21:17:56,183 TRACE _.s.h.s.r.WriteResultPublisher [parallel-3] [11ea62a0] [WP] request Long.MAX_VALUE
2021-03-04 21:17:56,183 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-3] [11ea62a0] onComplete [RECEIVED]
2021-03-04 21:17:56,184 TRACE _.s.h.s.r.WriteResultPublisher [parallel-3] [11ea62a0] [WFP] got subscriber reactor.core.publisher.MonoNext$NextSubscriber@3c3160ff
2021-03-04 21:17:56,184 TRACE _.s.h.s.r.WriteResultPublisher [parallel-3] [11ea62a0] [WFP] request Long.MAX_VALUE

# ReadListener fires below after 1 minute but reading is done.

2021-03-04 21:18:56,996 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-49] [11ea62a0] onError: java.net.SocketTimeoutException [COMPLETED]

# WriteListener#onError fires right after, cancels upstream and sends the error downstream

2021-03-04 21:18:56,996 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-49] [11ea62a0] cancel [RECEIVED]
2021-03-04 21:18:56,996 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [http-nio-8080-exec-49] [11ea62a0] cancel [RECEIVED]
2021-03-04 21:18:56,996 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [http-nio-8080-exec-49] [11ea62a0] onError: java.net.SocketTimeoutException [RECEIVED]
2021-03-04 21:18:56,996 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [http-nio-8080-exec-49] [11ea62a0] RECEIVED -> COMPLETED
2021-03-04 21:18:56,996 TRACE _.s.h.s.r.WriteResultPublisher [http-nio-8080-exec-49] [11ea62a0] [WP] failed: java.net.SocketTimeoutException [SUBSCRIBED]
2021-03-04 21:18:56,996 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-49] [11ea62a0] current "write" Publisher failed: java.net.SocketTimeoutException
2021-03-04 21:18:56,996 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-49] [11ea62a0] cancel [RECEIVED]
2021-03-04 21:18:56,996 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-49] [11ea62a0] onError: java.net.SocketTimeoutException [RECEIVED]
2021-03-04 21:18:56,996 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-49] [11ea62a0] RECEIVED -> COMPLETED
2021-03-04 21:18:56,996 TRACE _.s.h.s.r.WriteResultPublisher [http-nio-8080-exec-49] [11ea62a0] [WFP] failed: java.net.SocketTimeoutException [SUBSCRIBED]

# Error signal reaches the end of the chain where we dispatch (the `null` I believe is the exception message)

2021-03-04 21:18:56,997 TRACE o.s.h.s.r.ServletHttpHandlerAdapter [http-nio-8080-exec-49] [11ea62a0] onError: null
2021-03-04 21:18:56,997 TRACE o.s.h.s.r.ServletHttpHandlerAdapter [http-nio-8080-exec-49] [11ea62a0] Dispatch to container, to raise the error on servlet thread
2021-03-04 21:18:57,002 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-49] [11ea62a0] onError: java.net.SocketTimeoutException [COMPLETED]

# AsyncListener fires for the error but all is completed already so nothing to be done.

2021-03-04 21:18:57,002 DEBUG o.s.h.s.r.ServletHttpHandlerAdapter [http-nio-8080-exec-49] [11ea62a0] AsyncEvent onError: java.net.SocketTimeoutException
2021-03-04 21:18:57,002 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-49] [11ea62a0] onError: java.net.SocketTimeoutException [COMPLETED]
2021-03-04 21:18:57,002 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-49] [11ea62a0] cancel [COMPLETED]
2021-03-04 21:18:57,002 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-49] [11ea62a0] onError: java.net.SocketTimeoutException [COMPLETED]

Here is another one that came back after 105 seconds with AsyncListener#onTimeout:

...
2021-03-04 21:17:40,773 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [7de6eb95] isWritePossible false
2021-03-04 21:17:40,773 TRACE _.s.h.s.r.AbstractListenerWriteProcessor [parallel-3] [7de6eb95] onComplete [RECEIVED]
2021-03-04 21:17:40,773 TRACE _.s.h.s.r.WriteResultPublisher [parallel-3] [7de6eb95] [WP] got subscriber ResponseBodyFlushProcessor-WriteResultSubscriber
2021-03-04 21:17:40,773 TRACE _.s.h.s.r.WriteResultPublisher [parallel-3] [7de6eb95] [WP] request Long.MAX_VALUE
2021-03-04 21:17:40,773 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [parallel-3] [7de6eb95] onComplete [RECEIVED]
2021-03-04 21:17:40,773 TRACE _.s.h.s.r.WriteResultPublisher [parallel-3] [7de6eb95] [WFP] got subscriber reactor.core.publisher.MonoNext$NextSubscriber@49addc9f
2021-03-04 21:17:40,773 TRACE _.s.h.s.r.WriteResultPublisher [parallel-3] [7de6eb95] [WFP] request Long.MAX_VALUE

# AsyncListener#onTimeout fires below

2021-03-04 21:19:24,809 DEBUG o.s.h.s.r.ServletHttpHandlerAdapter [http-nio-8080-exec-159] [7de6eb95] AsyncEvent onTimeout
2021-03-04 21:19:24,809 TRACE _.s.h.s.r.AbstractListenerReadPublisher [http-nio-8080-exec-159] [7de6eb95] onError: java.lang.IllegalStateException: Async operation timeout. [COMPLETED]
...

Not sure why these two came in differently, one via Read/WriteListener#onError and the other via AsyncListener#onTimeout, is that expected? Most of the ones I got came as the latter.

Note there were still others that took longer than 122 seconds and were timed out but this seems to confirm your theory, Mark that if you wait long enough they do complete. Perhaps there is a way to control poller behavior or it might also be expected under load. Something to investigate.

Also for FWIW I was able to reproduce the response timeouts with Jetty as well. I'm not sure how much to make of that but at least it confirms it's not something isolated to Tomcat.

In conclusion I think that's about all we can do here. What do do think @danielra ?

Comment From: rstoyanchev

To clarify further, what I meant is after the bugs that we did find and fix, there should not be any remaining concurrency related issues that can be attributed to WebFlux. Taking a step back, the test scenario is very good at exposing issues given a large number of connections being closed and it helped to shake those out. However in practice that's rarely going to happen at such a rate and for that if it does spike, it should come down, and those requests do eventually complete. That said if you do have further feedback or experience around that to share, I would be interested to hear.

In terms of the NIO behavior, if we learn more, we could provide guidance in our documentation. @markt-asf since you are closer to the details of the underlying issue, would that be something you might be able to formulate as a question, perhaps on net-dev and see if it is, as you said, a JVM bug or expected behavior, along with possibly some ways to customize it?

Comment From: danielra

Thanks for all of the help digging in here on the WebFlux and Tomcat layers. It seems like in the examples so far where the request handling did eventually complete on its own (without the DemoFilter timeout cancelling), it was due to timeouts at other layers (a socket timeout, or an async operation timeout in Tomcat if I've interpreted correctly). However it seems pretty clear that in these cases the relevant connection was actually closed by the client long before the timeouts occurred, and I'm a bit surprised there wouldn't be notification from the Poller in a more timely manner in that case - but I'm not familiar with the lower level details there and what guarantees the Poller we are working with actually provides. I wonder if there is any low level configuration that could be changed to improve that situation - I may do some digging there. And I'll be interested to hear if we learn whether this is expected behavior or possibly a bug at deeper layer.

I'm looking forward to pulling in the fixes that have been made here when new versions for spring-framework and Tomcat are released to see what impact they have on the rate at which these occurrences are seen in my real application. Right now I typically see a number of occurrences in the range of hundreds per day (across an application cluster that processes a few billion requests per day). Given that these turned out to be in cases where the client had already closed the connection (typically due to client-side timeouts in my experience in practice), I think the impact is probably relatively minimal with the main concern being a potential resource leak if no timeout does come through in a sufficiently timely manner to cancel and clean things up.

I haven't identified any different-looking traces yet to indicate other potential concurrency / error handling issues at the Tomcat or WebFlux layers - so I'd be fine closing this out. I can open something new in the future if something else comes up with a new and interesting trace.

Thanks again for all the help digging in to understand the issue here and for the improvements made at both layers!

Comment From: markt-asf

I've done some simple local testing with the Poller and it behaves as you would expect. A client disconnect triggers causes the channel to be selected near enough instantly.

I went back to Tomcat and added some more logging. It appears that when things go wrong the same client connection was accepted twice, creating two SocketChannel instances referring to the same combination of client and server port and IP. That should not be possible but I have logs that show that is what is happening. If it really is what is happening it might well explain the issues with the Poller if the same connection is registered twice under two separate SocketChannels. I am going to see if I can recreate this in a simpler test case tomorrow.

Comment From: markt-asf

https://bugs.openjdk.java.net/browse/JDK-8263243

Comment From: rstoyanchev

I'm closing this here and will keep an eye on the outcome of the JDK issue.

Comment From: danielra

Just writing to note that I have seen 0 occurrences of these response timeouts in my real application so far after picking up the updated spring-framework + Tomcat versions with the fixes discussed here (running for about 1 week now).

Also, I have seen one new WARN message from Tomcat at a low volume:

Invalid null socket channel while processing poller event

Logged by org.apache.tomcat.util.net.NioEndpoint on a thread named like http-nio-{connector_port}-ClientPoller.

Additionally, I noticed that the JDK ticket linked above was resolved as "not an issue" as it was demonstrated that what was occurring appeared to be a problem at the OS layer independent of the JDK. I am curious if the issue has already been reported to Linux maintainers somewhere that I could follow. (I don't have an account to ask directly on the JDK ticket.)

Comment From: markt-asf

I'm not aware it has been reported to the Linux maintainers yet. I was hoping the JDK team would do that.

I have just reproduced this on Ubuntu using the server.c reproducer provided by the JDK team. I plan to open an Ubuntu bug shortly. I'll link to it here when I have done that.

Comment From: markt-asf

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1924298

Comment From: thomaskiesl

In NioEndpoint a condition was added. The following IOException is thrown to detect this issue:

2021-12-25 08:38:31,250 ERROR | https-jsse-nio-18443-Acceptor |
org.apache.tomcat.util.net.Acceptor                     | Socket accept failed java.io.IOException: Duplicate accept detected. This is a known OS bug. Please consider reporting that you are affected:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1924298
   at org.apache.tomcat.util.net.NioEndpoint.serverSocketAccept(NioEndpoint.java:545)
   at org.apache.tomcat.util.net.NioEndpoint.serverSocketAccept(NioEndpoint.java:78)
   at org.apache.tomcat.util.net.Acceptor.run(Acceptor.java:129)
   at java.base/java.lang.Thread.run(Thread.java:833)

My issue respectively worries are the following:

I receive this message ~20 times a day on my productive environment (CentOS 7, Java 17). Since the update one week ago I receive this issue. Currently there is only less workload on the system but I am worried that next week (after christmas) this bug becomes a big issue for me.

I do not understand excatly the problem of it. I have a Vaadin web application (Version 21) and the following questions:

  • When exactly does this error occure?
  • Is an error shown in the frontend (e.g. something like lost connection, manual refresh of the browser necessary, ...?) or is it only "under the hood" of my spring application?
  • I assume the bug occured already before your fix in November. What happend in case of this situation, before you added this check?

Thanks in advance for the information!

Comment From: markt-asf

The error occurs when the operating system reports the same client connection being accepted twice. This typically happens under load but, since the root cause is not known, we have to assume it could happen with any accept.

With the new error handling in place, the front-end should not see any side-effects of this. The first accept is handled as normal and the second is simply ignored. This avoids the sort of problems described previously in this issue.

Evidence collected so far indicates that the bug dates back at least 3 years and has probably been around longer than that.

If you see this issue on Ubuntu then, as the log message indicates, please update the Ubuntu bug. The more people that report they are affected, the more likely the bug is to get some attention.

If you see this issue on a different OS then: - Build and run the pure C test case attached to the Java bug reoprt - If you see the error, raise a bug for your operating system - It would be helpful if you provided a link for any OS bug created

Comment From: thomaskiesl

Thanks for the explanation!

Bug report for CentOS: https://bugs.centos.org/view.php?id=18383

Comment From: ettavolt

This is probably not the best place to put my ideas, but I'm hesitant to reg on JDK bug system and LaunchPad, sorry. ☺

As I see it, the wrk test is trying to hold -c 1000 HTTP over TCP connections from each of the -t 2 threads. Some people run it twice or thrice in parallel, approaching 4000 or 6000 open TCP connections on localhost.

Now we have net.ipv4.ip_local_port_range which is, say, on my systemd@5.16.2-arch1-1 is 32768 60999, i.e. 28232 ports available for client connections.

I don't see it as surprising that when the kernel has 6000 of 28232 options consumed it sometimes decides to reuse a recently freed port number. If it does choose a free port number uniformly random, phenomenon should occur once per 22232 requests made. Which isn't too rare.

P.S. @markt-asf, I came to this thread after my colleague has found that when configured to listen on UDS, Tomcat rejects all but the first curl request via it, because UDSA#equals compares path field, which is, obviously, never changes.

Shrinking the net.ipv4.ip_local_port_range (probably in a container to avoid browser/services noise) might be a good way to simplify testing Tomcat (or JDK) in a condition when port number is reused for the subsequent request.

Comment From: markt-asf

That might be a way to trigger a false positive but it misses the point. When the real issue occurs you get:

  • the two sockets returned from serversocket.accept() have the same client IP/port
  • the two sockets are returned from consecutive calls to accept()
  • the two sockets have different (consecutive) file descriptors
  • you can write to both sockets successfully

This isn't a clash with a recently freed port.

Comment From: illidan80

Getting this error constantly on Alpine, started right after we migrated from Spring 2.3.7 to 2.6.2

Comment From: MattWeiler

We found that this issue occurred for our projects when we upgraded from Spring Boot version 1.5.9-RELEASE to 2.6.2.

Does anyone know what the most recent version of Spring Boot is that is not affected by this?

Comment From: nesrinyesilkaya

The latest spring boot version(2.6.3) is still using tomcat 9.0.56, if you upgrade the tomcat version externally, the problem seems to be fixed. In your build.gradle file, add the following line. ext['tomcat.version'] = '9.0.58'

Update: tomcat upgrade is not working, after a while, the same error was occurred again

Comment From: MattWeiler

I just tried Spring Boot 2.3.7.RELEASE and the same issue arose during load testing. It's becoming a hard sell to business to let us try a new build with the latest version of tomcat.

Does anyone actually know 100% what the issue is? As it stands, it seems that each team (apache/tomcat, Spring, Linux/Ubuntu) is passing the buck.

But if the above comment by nesrinyesilkaya is correct and upgrading to the latest version of Tomcat fixes the issue, it doesn't sound like an OS issue (unless the latest Tomcat code has made changes to not use certain OS APIs).

Comment From: markt-asf

To summarise the current status...

The original issue had multiple root causes. Bugs were fixed in Spring and Tomcat. There was also an OS (Linux) bug identified that remains unfixed.

If you see this issue reported in the Tomcat logs then the best thing you can do is:

  • Use the test case described in the Ubuntu bug report to confirm that you are seeing the Linux bug
  • Report it via your support channel for the OS and push for a fix

As of Tomcat 10.1.0-M8, 10.0.14, 9.0.56 and 8.5.74 Tomcat will both log a warning if the issue is detected and take steps to ensure that the bug does not impact the running of the application.

The Tomcat team is aware of some scenarios that will falsely trigger the detection of this bug. The current Tomcat release round (10.1.0-M11, 10.0.17, 9.0.59 and 8.5.76) will include improved detection that avoids those false positives. Hence why it is important to confirm you are seeing this issue with the test case attached to the Ubuntu bug report before reporting it to your OS vendor.

It is possible that other issues will have similar symptoms. As usual, test with the latest stable releases of Spring and Tomcat and, if you still see the problem, open an issue and provide a test case that demonstrates it and it will get investigated.

Comment From: rui-zhou-phx

with tomcat 9.0.58 I still saw this error

Comment From: btpnlsl

I believe the OS bug is fixed starting in kernel 5.10 (starting in 5.10-rc6). The bug repros in kernel 5.10-rc4 but not 5.10-rc6.

I'm fairly sure that the bug fix comes from this commit which describes a race condition in the Linux TCP stack where two duplicate sockets will be created in the established connections hashtable.

Comment From: Parashuraam

may i know what is the solution ?

Comment From: brendsanchez

with spring 2.5.8 and tomcat 9.0.59 works to me

Comment From: llody55

I also encountered this problem in ubantu22.04

java.io.IOException: Duplicate accept detected. This is a known OS bug. Please consider reporting that you are affected: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1924298
    at org.apache.tomcat.util.net.NioEndpoint.serverSocketAccept(NioEndpoint.java:545) ~[tomcat-embed-core-9.0.56.jar!/:na]
    at org.apache.tomcat.util.net.NioEndpoint.serverSocketAccept(NioEndpoint.java:78) ~[tomcat-embed-core-9.0.56.jar!/:na]
    at org.apache.tomcat.util.net.Acceptor.run(Acceptor.java:129) ~[tomcat-embed-core-9.0.56.jar!/:na]
    at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]
root@k8s-master1:~# uname -a
Linux k8s-master1 5.15.0-125-generic #135-Ubuntu SMP Fri Sep 27 13:56:10 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

root@k8s-master1:~# cat /etc/os-release 
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy