Hi,
I've ran into an issue after upgrading to Spring Boot 3.2.x.
When running a @SpringBootTest with a regular non-mock servlet-based web context, shutdown of application context after test completes can take ~10s, while on 3.1.x or on Ubuntu 22.04 it takes just about 2s.
Steps to reproduce:
- Clone this repository: https://github.com/lared/slow-web-server-shutdown-sonoma (basically a raw spring initializr project with web starter and some logging sprinkled all over)
- Run
./gradlew --rerun-tasks --console=plain test - Observe that shutdown of the web server takes an awful lot of time
Extract from logs:
DemoApplicationTests > contextLoads() STANDARD_OUT
2024-02-23T00:06:55.525+01:00 DEBUG 76050 --- [ Test worker] org.springframework.test.context.cache : Spring test ApplicationContext cache statistics: [DefaultContextCache@73bb573d size = 1, maxSize = 32, parentContextCount = 0, hitCount = 4, missCount = 1, failureCount = 0]
2024-02-23T00:06:55.527+01:00 DEBUG 76050 --- [ Test worker] org.springframework.test.context.cache : Spring test ApplicationContext cache statistics: [DefaultContextCache@73bb573d size = 1, maxSize = 32, parentContextCount = 0, hitCount = 5, missCount = 1, failureCount = 0]
2024-02-23T00:06:55.528+01:00 DEBUG 76050 --- [ Test worker] org.springframework.test.context.cache : Spring test ApplicationContext cache statistics: [DefaultContextCache@73bb573d size = 1, maxSize = 32, parentContextCount = 0, hitCount = 6, missCount = 1, failureCount = 0]
2024-02-23T00:06:55.528+01:00 DEBUG 76050 --- [ Test worker] org.springframework.test.context.cache : Spring test ApplicationContext cache statistics: [DefaultContextCache@73bb573d size = 1, maxSize = 32, parentContextCount = 0, hitCount = 7, missCount = 1, failureCount = 0]
2024-02-23T00:06:55.528+01:00 DEBUG 76050 --- [ Test worker] org.springframework.test.context.cache : Spring test ApplicationContext cache statistics: [DefaultContextCache@73bb573d size = 1, maxSize = 32, parentContextCount = 0, hitCount = 8, missCount = 1, failureCount = 0]
2024-02-23T00:06:55.529+01:00 DEBUG 76050 --- [ Test worker] org.springframework.test.context.cache : Spring test ApplicationContext cache statistics: [DefaultContextCache@73bb573d size = 1, maxSize = 32, parentContextCount = 0, hitCount = 9, missCount = 1, failureCount = 0]
2024-02-23T00:06:55.529+01:00 DEBUG 76050 --- [ Test worker] org.springframework.test.context.cache : Spring test ApplicationContext cache statistics: [DefaultContextCache@73bb573d size = 1, maxSize = 32, parentContextCount = 0, hitCount = 10, missCount = 1, failureCount = 0]
2024-02-23T00:06:55.529+01:00 DEBUG 76050 --- [ Test worker] .c.s.DirtiesContextTestExecutionListener : After test method: class [DemoApplicationTests], method [contextLoads], class annotated with @DirtiesContext [false] with mode [null], method annotated with @DirtiesContext [false] with mode [null]
DemoApplicationTests STANDARD_OUT
2024-02-23T00:06:55.533+01:00 DEBUG 76050 --- [ Test worker] org.springframework.test.context.cache : Spring test ApplicationContext cache statistics: [DefaultContextCache@73bb573d size = 1, maxSize = 32, parentContextCount = 0, hitCount = 11, missCount = 1, failureCount = 0]
2024-02-23T00:06:55.533+01:00 DEBUG 76050 --- [ Test worker] .c.s.DirtiesContextTestExecutionListener : After test class: class [DemoApplicationTests], class annotated with @DirtiesContext [false] with mode [null]
2024-02-23T00:07:07.554+01:00 DEBUG 76050 --- [ionShutdownHook] o.s.c.support.DefaultLifecycleProcessor : Bean 'webServerStartStop' completed its stop procedure
2024-02-23T00:07:07.554+01:00 DEBUG 76050 --- [ionShutdownHook] o.s.c.support.DefaultLifecycleProcessor : Stopping beans in phase -2147483647
2024-02-23T00:07:07.555+01:00 DEBUG 76050 --- [ionShutdownHook] o.s.c.support.DefaultLifecycleProcessor : Bean 'springBootLoggingLifecycle' completed its stop procedure
2024-02-23T00:07:07.555+01:00 DEBUG 76050 --- [ionShutdownHook] o.s.s.concurrent.ThreadPoolTaskExecutor : Shutting down ExecutorService 'applicationTaskExecutor'
BUILD SUCCESSFUL in 15s
4 actionable tasks: 4 executed
Please note around 12s between test completion and webServerStartStop having completed. If the context is refreshed multiple times due to different contexts, this delay compounds, on one of the projects it can stall like that for a full minute.
The issue is not reproducible on 3.1.x, I've tested on MacOs Sonoma 14.3.1 and Ubuntu 22.04.
Comment From: wilkinsona
I can't reproduce this on macOS 13.6.4.
@lared, if the web server is taking a long time to stop, it may be a Tomcat problem. You could try enabling debug logging for Tomcat (org.apache.tomcat, org.apache.catalina, and org.apache.coyote) to see if you can identify the point where the delay is occurring. If Tomcat's logging doesn't reveal anything, you could also try taking some thread dumps of the test worker JVM during the 12 second delay to see where the process is stuck.
Comment From: lared
You're right.
When the server shuts down, it's stopping all of the endpoints. In tomcat-embed-core 10.1.18, in org.apache.tomcat.util.net.NioEndpoint, in method stopInternal, there's a lovely hardcoded wait of 10s for the acceptor to shut down:
@Override
public void stopInternal() {
if (!paused) {
pause();
}
if (running) {
running = false;
acceptor.stop(10);
[...]
}
Inside of org.apache.tomcat.util.net.Acceptor it sets a flag and waits for the acceptor loop to shut down, but it reliably fails to do so until the wait period elapses:
public void stop(int waitSeconds) {
stopCalled = true;
if (waitSeconds > 0) {
try {
if (!stopLatch.await(waitSeconds, TimeUnit.SECONDS)) {
log.warn(sm.getString("acceptor.stop.fail", getThreadName()));
}
} catch (InterruptedException e) {
log.warn(sm.getString("acceptor.stop.interrupted", getThreadName()), e);
}
}
}
This explains perfectly why it's almost always 12s of wait when it was ~2ish seconds prior. I'll take it to Tomcat team and see what they can say about it. @wilkinsona thanks for pointing me in the right direction!
Comment From: wilkinsona
If the discussion with the Tomcat team happens somewhere that's linkable, could you please add a link to it here so that others can follow along?
Comment From: lared
@wilkinsona it's on mailing lists so I'll spare the pain for now, but I can explain what was going on.
This is absolutely unrelated to Spring in any way shape of form, it's purely core Tomcat thing and how it (IMO incorrectly) treats machines with multiple NICs.
What went wrong
When an endpoint gets shut down, you have to shut down all the threads which are part of the connector. In particular, you need to shut down the Acceptor thread.
Acceptor thread is close to guaranteed to be blocked on .accept() method. The only reliable and non-destructive way of waking it up is to make a connection to it.
This is done in Tomcat here: https://github.com/apache/tomcat/blob/10.1.x/java/org/apache/tomcat/util/net/AbstractEndpoint.java#L1394
This is relatively simple - you get the address you are listening on, and you connect to that socket, sending it some random OPTIONS request. If this doesn't work for whatever reason, you will get stuck on that latch I linked above until 10s elapses.
What I noticed is that in my case the request was timing out (it was kind of difficult to pick because you need a very detailed logging level to see this):
Feb 25, 2024 7:13:10 PM org.apache.tomcat.util.net.AbstractEndpoint unlockAccept
FINE: Caught exception trying to unlock accept on port [0]
java.net.SocketTimeoutException: Connect timed out
at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:546)
at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
at java.base/java.net.Socket.connect(Socket.java:633)
at org.apache.tomcat.util.net.AbstractEndpoint.unlockAccept(AbstractEndpoint.java:1124)
at org.apache.tomcat.util.net.NioEndpoint.unlockAccept(NioEndpoint.java:390)
at org.apache.tomcat.util.net.AbstractEndpoint.pause(AbstractEndpoint.java:1394)
at org.apache.coyote.AbstractProtocol.pause(AbstractProtocol.java:678)
at org.apache.catalina.connector.Connector.pause(Connector.java:963)
at org.apache.catalina.core.StandardService.stopInternal(StandardService.java:484)
at org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:242)
at org.apache.catalina.core.StandardServer.stopInternal(StandardServer.java:974)
at org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:242)
at org.apache.catalina.startup.Tomcat.stop(Tomcat.java:447)
at SlowStop.main(SlowStop.java:14)
Why did this happen?
For reason unknown to me, instead of connecting to the actual address (loopback in case of any tests), Tomcat looks through all the interfaces and picks the first one that meets some criteria:
https://github.com/apache/tomcat/blob/10.1.x/java/org/apache/tomcat/util/net/AbstractEndpoint.java#L1164
Feb 25, 2024 7:13:08 PM org.apache.tomcat.util.net.AbstractEndpoint unlockAccept
FINER: About to unlock socket for:/[fc00:f853:ccd:e793:0:0:0:1%br-e26d1e697a66]:37591
In my scenario, due to iteration order, this is a bridge network I had defined in Docker. It has no routes defined to it, so connection attempts fail.
How to know if you are affected?
I created a minimum, reproducible (if you are affected) example here: https://github.com/lared/tomcat-acceptor-not-stopping-cleanly
All you need to do is clone it and run ./no_gradle.sh. If you see that the unlock socket address is not localhost, you are affected.
What is the most likely issue?
In my scenario, it was having some sort of empty docker network which had no routing defined. One docker network rm made it go away. If you need those, well, good luck.
Comment From: wilkinsona
Thanks very much, @lared.
I'm guessing your mailing list thread is this one. @markt-asf can this be improved in Tomcat or is there anything we can do in Boot to eliminate the unwanted delay?
Comment From: markt-asf
Most of this comes down to user configuration. Ideally, don't change the default shutdown address. If there is a valid reason for changing it, use a specific IP address rather than 0.0.0.0 or ::. If the user does use one of the "any local address" options then Tomcat has to try and guess a valid IP address it can connect to and while we can probably improve the way that works, I suspect there are always going to be some network configurations that cause confusion.
Ignore the above. I'm mixing up shutdown and the Acceptor. Insufficient caffeine.
If the Connector is bound to 0.0.0.0 or :: then Tomcat has to try and guess an IP address to connect to to unlock the Acceptor.
We'll have to have a look and see what we can do to improve the selection algorithm when the connector is bound to 0.0.0.0.
One configuration option that should improve things is to use a specific address setting in the Connector.
Comment From: quaff
don't change the default shutdown address.
FYI, I encountered this with vanilla apache-tomcat-9.0.86.tar.gz.
Comment From: quaff
@lared You can work around it by adding -Djava.net.preferIPv4Stack=true to JVM args.
Comment From: wilkinsona
Thanks, Mark. We're just leaning on Tomcat's defaults here so an improvement to the selection algorithm would certainly be welcome.
For those following this issue and looking to work around Tomcat's default behavior, you can set the server.address property. This will map down to setting the address on the connector's endpoint and stop it from using the wildcard address.
Comment From: lared
Thank you all for help! Overriding the IP address helped immediately.
On Spring Boot's side, would it not make sense to default server.address, management.server.address and similar to 127.0.0.1 in context of tests with non-mock web environment? That would be a breaking change, but I think for most people that would be the expected behavior, regardless of the address selection algorithm on shutdown or web server being used in general. I don't think there's that many people running external services against those test instances, but you know the product the best.
Comment From: wilkinsona
We try to align the behavior of @SpringBootTest as closely as possible with running an application's main method. With this in mind, I don't think we should use a different server.address or management.server.address for tests. Furthermore, this problem isn't test-specific. The same 10s delay may occur when shutting down a service that has been run via its main method. That too may be unwanted so I think it's better to fix this at source.
Comment From: markt-asf
FYI, the next round of Tomcat releases will include some changes that should improve things here.
I haven't made any changes to the algorithm that identifies the the address to connect to as I have been unable to replicate the described issue. I don't want to make changes to this without being able to test them.
What I have been able to do is significantly reduce the timeout waiting for the Acceptor to unlock. Rather than 10s it is now 100ms + twice the unlock timeout (which defaults to 250ms). That should eliminate the delay if the unlock fails.
Comment From: wilkinsona
Thanks very much, @markt-asf.