Spring Compatibility with JVM checkpoint restore (OpenJDK's Project CRaC)

Project CRaC introduces a mechanism for taking a JVM checkpoint snapshot (typically after startup) and then restoring from that checkpoint image for further deployment purposes, reducing the startup time.

Spring Boot on Tomcat is a target scenario for CRaC already. Spring applications are natural candidates for checkpoints after startup (plus some warming up through initial requests).

A couple of specific requirements need to be addressed: in particular the closing of file handles and network connections at checkpoint time plus subsequent restoring of those handles, as well as the refreshing of cached host metadata in a restored JVM. CRaC provides a Resource API for registering corresponding beforeCheckpoint/afterRestore callbacks.

From the Spring Framework side, we intend to revisit our Lifecycle contract where the existing stop/start mechanism implies the suspension of application-internal async processing and messaging resources already. We could narrow those semantics so that stop/start becomes a good citizen in a checkpoint/restore scenario, implying CRaC-compatible handling of resources in Spring-managed beans. This can then be triggered through a single ConfigurableApplicationContext.stop/start call which propagates to all contained beans, e.g. as part of a central CRaC Resource adapter in Spring Boot.

Comment From: tzolov

I've been testing CRaC in the context of Spring and Spring Integration.

For the tests I've put together a generic CRaCAdapter - autoconfiguration, that internally leverages the ConfigurableApplicationContext.stop/start and build a CRaC container Image with preinstalled Ubuntu 22.04 and latest CRaC JVM. Pre-build version of the image is also available at: tzolov/java_17_crac:latest.

Then I've tried the CRaCAdapter with few existing SI samples: - file-split-ftp

The run instructions show how to run the application, create a checkpoint and then re-run from the restored checkpoint.

It appears to work as expected. Apart of the embedded tomcat issue it works fine when Tomcat is replaced by Jetty.

kafka-dsl Repeating the same test with this long-running application reveals an important limitation about current CRaC implementation! Currently CRaC does not provide any mechanism to coordinate multiple threads. As a result when restoring from a checkpoint, CRaC will start the main thread before the Resource afterRestore methods have completed. For the kafka-dsl sample, the restored application will start trying to send Kafka messages before the afterRestore has completed, e.g. the Spring context hasn't started yet and Kafka connections haven't been reestablished. Expectedly this fails.

I started a related discussion on the CRaC mailing list ( here is a sample crac-demo to illustrate the issue). Radim and Dan responses are very interesting, though a bit beyond my debt.

Also as a result of the discussion this PR has been submitted: https://github.com/openjdk/crac/pull/58 The RCULock is an option to try to ensure safe checkpoint creation/restoration but still imposes application modifications and it is not without performance cost.

Comment From: rishiraj88

Thanks, @tzolov , for the descriptive comment. It's quite comprehensive and useful when perused.