SpringBoot liveness probe should return 'status UP' during slow db migration on startup

We use flyway to perform database migrations on startup. We use also liveness and readiness probes with Kubernetes.

Now let's say we have a flyway migration script that takes several minutes to execute (or at least longer than the liveness probe delay). During the startup of the spring boot application, the liveness probe returns status: 'DOWN' until the application is fully started. As a consequence, kubernetes declares this application as unhealthy and enter in a kill and restart loop. Killing the app rollback the db migration. After several kills and restarts, the application rolling upgrade remains stuck in a failed state.

How to reproduce ?

create a new new spring boot project with a db and flyway migration enabled
configure liveness probe as follow:

management:
  health:
    livenessState:
      enabled: true
    readinessState:
      enabled: true
  server:
    port: 9080
    base-path: '/management'
  endpoint:
    health:
      probes:
        enabled: true
      group:
        readiness.include: db

set a migration script that takes a long time to execute (ex: with postgres select pg_sleep(60000))
start the application

Expected:

the liveness probe should return status: UP and the readiness probe should return status: DOWN until the db migration script is completed and the application is fully startup.

Current:

the liveness probe return status: DOWN which cause kubernetes to kill and restart the application without giving it the chance to complete the migration.

The kubernetes startup probe does not help here since it could delay a valid and necessary kill and restart operation because the longer startup could be not related to a slow db migration script. The startup probe cannot infer whether the slow start is normal or not...

Comment From: Nowheresly

This issue may be related to https://github.com/spring-projects/spring-boot/issues/28432

Comment From: wilkinsona

the liveness probe returns status: 'DOWN' until the application is fully started.

This isn't the behaviour that I would expect to see. With the management server running on a separate port, it will be initialized in response to the main server's WebServerInitializedEvent. This event won't be published until after main context has been refreshed and is live, that is to say after the database migration has completed. Therefore, while the database migration is in progress, I would expect the liveness probe to be unavailable.

Can you please provide a sample that reproduces the down response while the migration is in progress?

Comment From: Nowheresly

Hi @wilkinsona Thanks for your quick reply!

I tried to reproduce the status: DOWN while the migration is ongoing with a simple demo project, but I was indeed not able to reproduce. The call to the liveness probe fails with connection error because the management port is not yet listening. I wonder why it seems I don't have this behavior with my other project.

You can see my demo project here:

https://github.com/Nowheresly/spring-boot-issue-32282

Yet, the issue is still valid. With a down liveness probe, kubernetes will kill the pod and thus, the migration has no chance to complete.

Do you have any suggestions about how to handle properly this scenario ?

Comment From: wilkinsona

I would use a startup probe. Your situation is exactly what they're intended for:

Startup probes are useful for Pods that have containers that take a long time to come into service. Rather than set a long liveness interval, you can configure a separate configuration for probing the container as it starts up, allowing a time longer than the liveness interval would allow.

Comment From: Nowheresly

Well, as stated previously, I don't think the startup probe is of any help here. Our application usually does NOT start slowly. But it exceptionally could because a new release may include a heavy flyway migration script.

Using a startup probe with a high initialDelay could indeed fix the kill and restart loop described previously. But it will increase the time needed to detect that a wrong behavior for all the normal cases when there's no migration scripts to run for example.

I mean, we can indeed estimate how much time on average is required to load all the beans in an applicationContext on startup and set a startup probe accordingly. But as soon as we add flyway or liquibase scripts in our startup, the average time to start an application became unpredictable. It adds some randomness as far as the startup time is concerned that ends up with this kill and restart loop when initialDelay is set to the normal and usual values.

Maybe there's no solution to this problem, or maybe I am missing something, but I came to the conclusion that a way to cleanly fix the problem could be by returning status UP for startup probe and liveness probe during the flyway migration while returning status DOWN for the readiness probe.

I hope I was clear enough in my description. If you have any pointer or advice to share regarding this kind of setup, that would be greatly appreciated.

Comment From: philwebb

I'm not sure that there's much we can do out of the box to help with this situation. By default we only change the LivenessState to CORRECT when the ApplicationContext has actually started (see EventPublishingRunListener).

A Flyway migration is triggered by the FlywayMigrationInitializer, and happens in an InitializingBean callback. The context isn't considered started until all beans have been initialized.

I think for most users, this is a sensible default, but it sounds like in your case you want to change the LivenessState earlier if a database migration is running. This means that you're willing to accept the risk that if something goes wrong during the database migration your pod won't be killed.

I thought one way that you might be able to solve this is to implement your own FlywayMigrationStrategy bean. You could could have one that changes the LivenessState during the migration. Something like:

public void migrate(Flyway flyway) {
    try {
        AvailabilityChangeEvent.publish(context, LivenessState.CORRECT);
        flyway.migrate();
    }
    finally {
        AvailabilityChangeEvent.publish(context, LivenessState.BROKEN);
    }
}

The problem is, I'm not sure that the actuator endpoint that responds to the probes will actually be up until the ApplicationContext has started. That means, even though the internal state of the probe is what you want, there's no way for Kubernetes to get it.

I honestly can't think of a good way to fix this our current design if you want to use HTTP probes. Perhaps you could use a file based approach instead of HTTP:

livenessProbe:
  exec:
    command:
    - cat
    - /tmp/healthy

Comment From: Nowheresly

Hi @philwebb . Thank you for taking the time to provide a very clear answer!

By reading your comment, I realized that maybe this issue should have been a question instead of an issue... We have indeed considered to implement our own liveness probe to handle this case, more or less with the very same that you described.

Yet, we wondered whether our use case is that much specific... A continuously deployed spring boot app under kubernetes with startup / liveness / readiness probes defined using actuator and using flyway for db migration. So we thought that maybe this use case is not so rare and we are missing something.

By reading your answers, I understand it's not that easy to change the liveness probe behaviour considering the potential impacts. So we have no choice but to implement our own liveness probe.

Right now I have no other idea, so I guess probably this issue can be closed...

Comment From: philwebb

This issue triggered some interesting discussion within the team, especially about potentially starting the management context (if actuator is on a different port) before the main context. I'll close this one for now, but we may well try to revisit our design sometime in the future.

Comment From: alexandru-lazarev

@Nowheresly Hi, so I am interested what final solution did You choose? I am facing now similar design issue an thinking of moving Flyways only in a separate K8s init-container or Job