Project CRaC introduces a mechanism for taking a JVM checkpoint snapshot (typically after startup) and then restoring from that checkpoint image for further deployment purposes, reducing the startup time.
Spring Boot on Tomcat is a target scenario for CRaC already. Spring applications are natural candidates for checkpoints after startup (plus some warming up through initial requests).
A couple of specific requirements need to be addressed: in particular the closing of file handles and network connections at checkpoint time plus subsequent restoring of those handles, as well as the refreshing of cached host metadata in a restored JVM. CRaC provides a Resource API for registering corresponding beforeCheckpoint/afterRestore callbacks.
From the Spring Framework side, we intend to revisit our Lifecycle contract where the existing stop/start mechanism implies the suspension of application-internal async processing and messaging resources already. We could narrow those semantics so that stop/start becomes a good citizen in a checkpoint/restore scenario, implying CRaC-compatible handling of resources in Spring-managed beans. This can then be triggered through a single ConfigurableApplicationContext.stop/start call which propagates to all contained beans, e.g. as part of a central CRaC Resource adapter in Spring Boot.
Comment From: tzolov
I've been testing CRaC in the context of Spring and Spring Integration.
For the tests I've put together a generic CRaCAdapter - autoconfiguration, that internally leverages the ConfigurableApplicationContext.stop/start and build a CRaC container Image with preinstalled Ubuntu 22.04 and latest CRaC JVM.
Pre-build version of the image is also available at: tzolov/java_17_crac:latest.
Then I've tried the CRaCAdapter with few existing SI samples:
- file-split-ftp
The run instructions show how to run the application, create a checkpoint and then re-run from the restored checkpoint.
It appears to work as expected. Apart of the embedded tomcat issue it works fine when Tomcat is replaced by Jetty.
- kafka-dsl
Repeating the same test with this long-running application reveals an important limitation about current CRaC implementation! Currently CRaC does not provide any mechanism to coordinate multiple threads.
As a result when restoring from a checkpoint, CRaC will start the main thread before the Resource
afterRestoremethods have completed. For thekafka-dslsample, the restored application will start trying to send Kafka messages before theafterRestorehas completed, e.g. the Spring context hasn't started yet and Kafka connections haven't been reestablished. Expectedly this fails.
I started a related discussion on the CRaC mailing list ( here is a sample crac-demo to illustrate the issue). Radim and Dan responses are very interesting, though a bit beyond my debt.
Also as a result of the discussion this PR has been submitted: https://github.com/openjdk/crac/pull/58
The RCULock is an option to try to ensure safe checkpoint creation/restoration but still imposes application modifications and it is not without performance cost.
Comment From: rishiraj88
Thanks, @tzolov , for the descriptive comment. It's quite comprehensive and useful when perused.