SpringBoot Liveness/Readiness probes failure while using graceful shutdown

Hi! While I was investigating on Spring Boot graceful shutdown feature as well as liveness/readiness endpoints (to be called at a later stage from Kubernetes probes) I discovered that the mentioned endpoints become unreacheble as soon as shutdown is initiated by a SIGTERM. This would cause Kubernetes liveness probe fail and may lead to unclean shutdown. As the purpose of graceful shutdown is the opposite I filled this report.

I created a sample Spring Boot project on GitHub to simplify issue testing. The service exposes a controller that, when invoked, will sleep for the desired amount of time.

Please see instructions to reproduce the issue below. Many thanks for your support, best regards

Paolo

Environment:

OS: Windows 10 Pro
JDK: openjdk 11.0.7 2020-04-14 LTS
Spring boot starter parent : 2.4.1
Servlet Engine: Apache Tomcat/9.0.41

How to reproduce:

Please checkout sample project https://github.com/paoven/graceful-shutdown
Run the project and call the endpoint (E.g. curl -H "Content-Type:application/json" -H "Accept:application/json" -XPOST -d "1" http://localhost:8080/wait?waitMs=20000)
Take note of process PID and initiate a graceful shutdown by issuing a SIGTERM signal within 20seconds. (E.g. kill -SIGTERM [PID])
The Spring Boot service logs the graceful shutdown and the request is fulfilled as it takes less than the configured graceful shutdown timeout (30s). The problem is that, as soon as you issue the SIGTERM, the Spring Boot actuator health endpoints become unreachable (both liveness and rediness groups). External Systems relying on that endpoints for availability/healthy checks (such as Kubernetes) would think that the service is not available anymore too soon.

Comment From: bclozel

@paoven Have you seen the kubernetes deployment section in the reference documentation?

We assume that the graceful shutdown sequence should start once the platform has stopped routing traffic to the application instance. The shutdown delay really depends on the platform (in your case the readiness check period, which is configurable).

We're considering adding an optional delay in #20995 - but a possible solution here is to configure a preStop hook as explained in our documentation.

Comment From: paoven

@bclozel Thanks for prompt answer. I was able to obtain a clean shutdown (both on client and server side) by leveraging on preStop hook and a sleeping thread which introduces the mentioned delay. About the relation between the shutdown delay and readiness check period, as far as I can see Kubernetes is not relying on liveness/readiness checks anymore as soon as the Pod enters Terminating state but of course the delay is necessary in order to guarantee that the platform removes the Pod reference from Services/RS/ and has effectively stopped sending traffic to it). Thanks again, kind regards

Comment From: jcook793

The problem with preStop delays is that they are fixed amounts of time. So if I'm willing to let uploads take 3 minutes if necessary, that means every time I do a pod deployment it is going to take 3 minutes, even if there is no traffic at all.

Comment From: paoven

@jcook793 as far as I understoop preStop delay is just necessary to let the platform/load balancer to stop routing new traffic to the shutting down service (In Kubernetes should be 5/10 seconds according to this useful article).

Already existing connections are not forcibly dropped at this stage. In the upload example you mentioned, when Spring Boot receives the SIGTERM signal, it will wait for pending requests to terminate up to the maximum shutdown timeout (e.g. spring.lifecycle.timeout-per-shutdown-phase configured to 3mins), but if there are no requests it will shutdown fast without waiting 3 minutes.

Comment From: bclozel

@jcook793 See @paoven 's comment - the app is shutting down as soon as possible.

I'm closing this issue as a duplicate of #20995 - I don't know if we'll implement it, but in the meantime the preStop hook seems like the sensible solution here.

Comment From: lturcsanyi

The documentation is still wrong, because it states that during the graceful shutdown period the liveness probe should report LIVE state, but both endpoints are unreachable. docs

Comment From: bclozel

@lturcsanyi could you quote here exactly the section that states this?

Comment From: lturcsanyi

Sorry, I linked a wrong section, in the "Application lifecycle and Probes states" section, the second table: "When a Spring Boot application shuts down:" I would assume this means that the liveness probe still returns live state.

Comment From: bclozel

Thanks for the feedback @lturcsanyi , I've created #24843