SpringBoot Improve CassandraHealthIndicator with more robust mechanism

The current CassandraHealthIndicator is examining the health of the underlying cluster by executing simple SELECT .release_version FROM system.local with consistency level set to ONE. It serves the purpose of the very simple health indicator, but it may be unreliable in the cluster of more nodes. Assuming that we have N nodes cluster, the current health check will return UP if at least ONE node is up and running. In the worst-case situation, all N - 1 nodes may be down, but the health-check will still report the status as UP.

The CqlSession is keeping the state of the cluster at the client-side and is available via the getMetadata() method. I am proposing improving the CassandraHealthIndicator to report status based on the status of all nodes in the cluster. Additionally, the indicator can report useful status info per node such as distance, open_connections, version, etc The improved algorithm would be very similar to https://github.com/micronaut-projects/micronaut-cassandra/pull/62/files#diff-d63dab53c3ed54a82678650761baaf5aR55-R104.

Comment From: snicoll

Thanks for the suggestion. Our health indicator do not include too many details (usually a couple) so I am not keen to add as many details.

Paging @adutra for some insights on the current query.

Comment From: adutra

@snicoll the changes @tomekl007 is proposing here are indeed part of our effort to improve Cassandra health checks in some popular frameworks.

The usage of SELECT ... FROM system.local queries as a means to determine the cluster health is widely used, but it has some drawbacks: since this is a true CQL query, its execution is a blocking call, requires network I/O, and can timeout (and a timeout doesn't mean the cluster is down). It also queries a random node; because of that, the returned information may vary, as it is the case with the current Spring Boot implementation, which returns the server's version.

Timeouts in particular are worrisome. If an application wants to enforce aggressive SLAs, it's probably going to set a very low general timeout for CQL queries, thus exposing the health checks to frequent failures due to timeouts.

This query could be conveniently replaced with an inspection of the driver's metadata. The driver by default maintains an up-to-date snapshot of the cluster's topology and keeps a list of nodes with their states; this list is updated asynchronously. Using this information to determine the cluster's health would have the advantage of not requiring any network I/O and would not be a blocking call.

We would like to propose a PR for this. If you don't want any details in the health report, the implementation could be as simple as this:

Collection<Node> nodes = this.session.getMetadata().getNodes().values();
boolean atLeastOneUp = nodes.stream().map(Node::getState).anyMatch(state -> state == NodeState.UP);
if (atLeastOneUp) builder.up(); else builder.down();

The above would return UP if there is at least one node UP in the cluster. This is functionally equivalent to executing a SELECT ... FROM system.local query.

The changes should be straightforward for CassandraDriverHealthIndicator. However we would need to figure out how to do it for CassandraHealthIndicator since it doesn't have immediate access to the session.

Comment From: adutra

(The issue title mentions "with more detailed information" but it should rather indicate "with more robust mechanism" btw)

Comment From: tomekl007

PR created: https://github.com/spring-projects/spring-boot/pull/23041

Comment From: snicoll

Closing in favour of PR #23041