There are a couple cases where using WebSockets with WebFlux on Tomcat can leave connections in a CLOSE_WAIT state after closing the websocket session. These connections stick around, and will eventually cause tomcat to reach its connection limit (if set). This prevents tomcat from accepting new connections, and thus leads to the server becoming unresponsive (except for previously established connections)
When running the same test cases with WebFlux on Netty or Undertow, the connections are closed properly.
I have provided an example project (ws-close-waiting.zip) that shows the cases where the connection gets stuck in CLOSE_WAIT on tomcat after the websocket session is closed.
The project has three websocket endpoints, each showing a different case (only 2 cases fail). In each case, the server will close the websocket session (but in different ways) after receiving a message from the client.
1. /closeZip
- Calls session.close(...)
while processing the input stream. The input/output stream are merged with the zip
operator. This case leaves the connection in CLOSE_WAIT on tomcat.
2. /closeZipDelayError
- Calls session.close(...)
while processing the input stream. The input/output stream are merged with the zipDelayError
operator. This case properly closes the connection. I included this case for comparison with the first case. I'm not sure what the downsides of using zipDelayError
would be though. Advice appreciated.
3. /exceptionZipDelayError
- Propagates an exception on the input stream, but handles that exception with onErrorResume
by calling session.close(...)
. The input/output streams are merged with the zipDelayError
operator. This case leaves the connection in CLOSE_WAIT on tomcat. I included this case to show that the zipWithError
operator will "fix" some cases (2), but not every case.
I have enabled the following logging:
logging.level.org.springframework.http.server.reactive=debug
logging.level._org.springframework.http.server.reactive.AbstractListenerReadPublisher=trace
logging.level._org.springframework.http.server.reactive.AbstractListenerWriteProcessor=trace
logging.level._org.springframework.http.server.reactive.AbstractListenerWriteFlushProcessor=trace
logging.level._org.springframework.http.server.reactive=trace
logging.level.reactor.netty=debug
logging.level.org.apache.tomcat.websocket=debug
In the failing cases (1 and 3), the read publisher logs a cancel message, and I see the following log lines:
2023-04-28T13:48:29.358+02:00 TRACE 227341 --- [nio-8080-exec-4] _.s.h.s.r.AbstractListenerReadPublisher : [37936546] cancel [READING]
2023-04-28T13:48:29.358+02:00 TRACE 227341 --- [nio-8080-exec-4] _.s.h.s.r.AbstractListenerReadPublisher : [37936546] READING -> COMPLETED
In the successful case (2), the read publisher does not log a cancel message. I think the cancelling is the underlying problem. It prevents the server from noticing that the client has closed the connection.
To test each use case, I used netstat to observe connections, and websocat as the websocket client. Specifically...
I started netstat in a loop to observe connections every second...
while true ; do clear; date; sudo netstat -pn | grep 8080; sleep 1; done
Then I used websocat in another terminal as follows:
- connect to one of the three websocket endpoints...
e.g.
websocat -v -v ws://localhost:8080/closeZip
(orcloseZipNoDelay
orexceptionZipNoDelay
) netstat will show something like...Fri Apr 28 01:59:55 PM CEST 2023 tcp 0 0 127.0.0.1:57316 127.0.0.1:8080 ESTABLISHED 232014/./websocat tcp6 0 0 127.0.0.1:8080 127.0.0.1:57316 ESTABLISHED 231835/java
- type something on the websocat console and press enter. websocat will send what you typed as a text websocket message, and leave the connection open. netstat output remains unchanged
- press CTRL-D on the websocat console to end the input stream. websocat will exit.
For the successful cases, the connections will disappear from netstat.
For the failure cases, netstat will show something like...
Fri Apr 28 02:01:36 PM CEST 2023 tcp 0 0 127.0.0.1:57316 127.0.0.1:8080 FIN_WAIT2 - tcp6 7 0 127.0.0.1:8080 127.0.0.1:57316 CLOSE_WAIT 231835/java
Eventually the old client side connection (the one in FIN_WAIT2) will go away. But the server connection (the one in CLOSE_WAIT) will remain until the server is shutdown.
Again, when running WebFlux on Netty or Undertow, the connections always go away in all three cases.
Comment From: sdeleuze
That seems to be a pretty low level issue, have you identified anything that let you think this issue is on Spring side and not on Tomcat side?
Comment From: philsttr
Just that Spring's AbstractListenerReadPublisher
is cancelling the stream in the failure cases. But it does not cancel in the success case.
Specifically this
In the failing cases (1 and 3), the read publisher logs a cancel message, and I see the following log lines:
2023-04-28T13:48:29.358+02:00 TRACE 227341 --- [nio-8080-exec-4] _.s.h.s.r.AbstractListenerReadPublisher : [37936546] cancel [READING] 2023-04-28T13:48:29.358+02:00 TRACE 227341 --- [nio-8080-exec-4] _.s.h.s.r.AbstractListenerReadPublisher : [37936546] READING -> COMPLETED
In the successful case (2), the read publisher does not log a cancel message. I think the cancelling is the underlying problem. It prevents the server from noticing that the client has closed the connection.
TBH, I'm not an expert at the interaction between Spring and Tomcat. I'd love if someone more familiar with it could take a look at the example project and the reproduction steps I provided.
Comment From: rstoyanchev
This is related to the resuming and suspending of the Tomcat WebSocket session, and not reading the final close frame (I think) if the session is closed while in the suspended state.
The session is closed while handling an incoming message. Tomcat calls onClose
and AbstractListenerWebSocketSession#handleClose
completes the reading and the writing sides. The reading side actually defers transitioning to COMPLETED
temporarily because it's still in the READING
state (and on the same thread). However, as the zip
operator is used to combine the reading and writing, when the writing processor completes, the zip
operator then cancels reading too, and currently the handling of cancel transitions the read publisher to COMPLETED
immediately. As a result when the callstack unwinds back to the reading loop, we just exit without transitioning to the DEMAND state before completing, and so the session remains suspended, and Tomcat never reads any more frames. By comparison, in the success scenario I see the following additional lines at the end:
[nio-8080-exec-2] o.a.t.websocket.server.WsFrameServer : wsFrameServer.onDataAvailable
[nio-8080-exec-2] o.a.t.websocket.server.WsFrameServer : Read [6] bytes into input buffer ready for processing
[nio-8080-exec-2] o.a.t.websocket.server.WsFrameServer : WebSocket frame received. fin [true], rsv [0], OpCode [8], payload length [0]
I can make changes so that cancel handling is similar to completion handling when in the READING
state where completion is temporarily delayed until we exit the reading loop. This was was originally done in #26834 but did not cover a cancel signal.
That said I am also wondering if Tomcat should be handling this differently, i.e. when the session is closed should the session be resumed to allow the closing to be completed. @markt-asf any thoughts on that?
Comment From: philsttr
Thanks as always @rstoyanchev !
Comment From: markt-asf
Late to the party but I'll be looking at this today on the Tomcat side.
I was on the fence as to whether whatever called suspend()
should be expected to call resume()
if it is going to close the connection. It could be argued either way.
Regardless, it looks like Tomcat needs to (better) handle the case where the WebSocket session is closed by the server but the client confirmation of the close is never received. That is what I am planning on looking at.