Project CRaC introduces a mechanism for taking a JVM checkpoint snapshot (typically after startup) and then restoring from that checkpoint image for further deployment purposes, reducing the startup time.
Spring Boot on Tomcat is a target scenario for CRaC already. Spring applications are natural candidates for checkpoints after startup (plus some warming up through initial requests).
A couple of specific requirements need to be addressed: in particular the closing of file handles and network connections at checkpoint time plus subsequent restoring of those handles, as well as the refreshing of cached host metadata in a restored JVM. CRaC provides a Resource API for registering corresponding beforeCheckpoint
/afterRestore
callbacks.
From the Spring Framework side, we intend to revisit our Lifecycle
contract where the existing stop/start mechanism implies the suspension of application-internal async processing and messaging resources already. We could narrow those semantics so that stop/start becomes a good citizen in a checkpoint/restore scenario, implying CRaC-compatible handling of resources in Spring-managed beans. This can then be triggered through a single ConfigurableApplicationContext.stop/start
call which propagates to all contained beans, e.g. as part of a central CRaC Resource adapter in Spring Boot.
Comment From: tzolov
I've been testing CRaC
in the context of Spring and Spring Integration.
For the tests I've put together a generic CRaCAdapter - autoconfiguration, that internally leverages the ConfigurableApplicationContext.stop/start
and build a CRaC container Image with preinstalled Ubuntu 22.04
and latest CRaC JVM
.
Pre-build version of the image is also available at: tzolov/java_17_crac:latest.
Then I've tried the CRaCAdapter
with few existing SI samples:
- file-split-ftp
The run instructions show how to run the application, create a checkpoint and then re-run from the restored checkpoint.
It appears to work as expected. Apart of the embedded tomcat issue it works fine when Tomcat is replaced by Jetty.
- kafka-dsl
Repeating the same test with this long-running application reveals an important limitation about current CRaC implementation! Currently CRaC does not provide any mechanism to coordinate multiple threads.
As a result when restoring from a checkpoint, CRaC will start the main thread before the Resource
afterRestore
methods have completed. For thekafka-dsl
sample, the restored application will start trying to send Kafka messages before theafterRestore
has completed, e.g. the Spring context hasn't started yet and Kafka connections haven't been reestablished. Expectedly this fails.
I started a related discussion on the CRaC mailing list ( here is a sample crac-demo to illustrate the issue). Radim and Dan responses are very interesting, though a bit beyond my debt.
Also as a result of the discussion this PR has been submitted: https://github.com/openjdk/crac/pull/58
The RCULock
is an option to try to ensure safe checkpoint creation/restoration but still imposes application modifications and it is not without performance cost.
Comment From: rishiraj88
Thanks, @tzolov , for the descriptive comment. It's quite comprehensive and useful when perused.