Affects: Spring Boot 3.3.0, but I think every version supporting CRaC is affected


Consider the following simple application:

@SpringBootApplication
@EnableScheduling
class MyApp     

fun main(args: Array<String>) {
    runApplication<MyApp>(*args)
}

@RestController
class SchedulingController {
    val data = AtomicInteger(0)
    @Scheduled(timeUnit = TimeUnit.SECONDS, fixedRate = 1L)
    fun increment(){
        println(data.incrementAndGet())
    }
    @GetMapping("/")
    fun data() = data.get()
}

My actions are following

  1. ./gradlew build
  2. Build with the following Dockerfile (docker build -t last_edit_pre .):
FROM bellsoft/liberica-runtime-container:jdk-crac-slim

ADD build/libs/last_edit-0.0.1-SNAPSHOT.jar /app/app.jar
WORKDIR /app
ENTRYPOINT java -XX:CRaCCheckpointTo=/app/checkpoint -jar /app/app.jar
  1. Run it with docker run --privileged -p 8081:8080 -it --name last_edit_pre last_edit_pre:latest and wait for some time (for example, until count 10)
  2. Create a snapshot with docker exec -it last_edit_pre jcmd 129 JDK.checkpoint
  3. Commit the snapshot to new image docker commit last_edit_pre last_edit_post
  4. Run the newly-created image like this docker run -it --rm --entrypoint java last_edit_post:latest -XX:CRaCRestoreFrom=/app/checkpoint

Here I observe an interesting behavior: Counter very quickly rewinds from the checkpoint moment to current time. The later I restore from the snapshot the more iterations it quickly rewinds.

It is potentially dangerous: if the scheduled operation is CPU-intensive of performs a dangerous operation - it can actually crush the application with all range of causes.

I do realize that sometimes this behavior might be required, in this case it should probably be an application property.

Comment From: sdeleuze

@asm0dey So please find below our findings and proposal.

First, be aware that only the x86 variant of bellsoft/liberica-runtime-container:jdk-crac-slim is available, so I used on my Mac M2 a modified version of https://github.com/sdeleuze/spring-boot-crac-demo to reproduce.

Second, the behavior you report is only visible with the on-demand checkpoint/restore of a running application mode, not with the automatic checkpoint/restore at startup one.

Third, if we take a step back, the behavior we see kind of makes sense given the fact that fixedRate behavior is described as "execute the annotated method with a fixed period between invocations", with the first invocations being perfomed before the checkpoint. Interesting, fixedDelay works without such side effect if you want a behavior where a CRaC restoration is similar to just a faster startup as its definition is "execute the annotated method with a fixed period between the end of the last invocation and the start of the next". Notice also that cron works also as you would expect here as cron expressions are calculated after every task execution as well.

As you mention it yourself, sometimes current behavior might be required, sometimes not, so I don't think we should change the default behavior. And since fixedDelay and cron works as expected with CRaC if you want a behavior where a CRaC restoration is similar to just a faster startup, I think I would suggest to turn this issue into a documentation one that would add a sheduling section in the Spring CRaC refdoc to warn about this side effect of on-demand checkpoint when fixedRate is used, and recommending using fixedDelay and cron instead for that use case. Would that be ok from your POV?

Comment From: asm0dey

@sdeleuze thank you for looking into it! Now, when you explained the intricacies of the behavior it makes a perfect sense! And I now when I understand the behavior I totally agree that it's just a matter of documentation.