SpringBoot Log failing calls to health indicators

This is similar to #22509 to improve root cause analysis when probe endpoints returned non 200 response.

I am migrating k8s http probes to use readiness and liveness health group endpoints(/actuator/health/[readiness|liveness]). When these endpoints return non UP status(other than 200 response), k8s stops traffic or shutdown the pod. When such event happens, k8s http probe only record the returned http status for the reason of its probe failure. This makes hard to investigate WHY readiness/liveness probes returned non 200 response when somebody needs to investigate the failure reason later. Even if k8s could record body of probe response, it would be nicer to have such information in application log.

I wrote this implementation to our services to log information when health endpoints returns non UP response.

@Slf4j
public class LoggingHealthEndpointWebExtension extends HealthEndpointWebExtension {

    public LoggingHealthEndpointWebExtension(HealthContributorRegistry registry, HealthEndpointGroups groups) {
        super(registry, groups);
    }

    @Override
    public WebEndpointResponse<HealthComponent> health(ApiVersion apiVersion, SecurityContext securityContext,
            boolean showAll, String... path) {
        WebEndpointResponse<HealthComponent> response = super.health(apiVersion, securityContext, showAll, path);
        HealthComponent health = response.getBody();
        if (health == null) {
            return response;
        }

        Status status = health.getStatus();
        if (status != Status.UP) {
            Map<String, HealthComponent> components = new TreeMap<>();
            if (health instanceof CompositeHealth) {
                Map<String, HealthComponent> details = ((CompositeHealth) health).getComponents();
                if (details != null) {
                    components.putAll(details);
                }
            }
            log.warn("Health endpoints {} returned {}. components={}", path, status, components);
        }

        return response;
    }

}

If HealthEndpointSupport could have logging capability (or HealthEndpointWebExtension and ReactiveHealthEndpointWebExtension for web only), then we don't need to have this custom implementation.

Something like:

boolean enableLogging;

if(this.enableLogging && health.getStatus() != Status.UP) {
  log.warn(...);
}

Comment From: wilkinsona

Each indicator that subclasses AbstractHealthIndicator should already log a warning when a health check fails:

https://github.com/spring-projects/spring-boot/blob/c626e947cbb3a829449cdbdf18d2b7c47459f900/spring-boot-project/spring-boot-actuator/src/main/java/org/springframework/boot/actuate/health/AbstractHealthIndicator.java#L84-L90

Does this existing logging not meet your needs? I don't think we should duplicate it, particularly in a manner that is web-specific.

Comment From: ttddyy

@wilkinsona Thanks for the pointer.

The logging on the code above only happen when the actual health check logic throws an exception. Some of the healthcheck implementation throws Exception and let here to set DOWN status, but some are constructing response health status in the logic without throwing exception.

The intention for my case is when health check(aggregated) returns non UP state(DOWN, OUT_OF_SERVICE, UNKNOWN), then perform logging. This is because the response of liveness health group(/actuator/health/liveness) will trigger the restart of the application. When it happens, I would like to come back to see application log and find out which health indicator returned non-UP status as well as details if available.

Comment From: bclozel

After discussing with the team, we've decided to log failing indicators individually at the WARN level in all cases, i.e. not just when an exception is thrown.

We've also decided to log ApplicationAvailability state changes in a separate issue, see #23098.

Comment From: snicoll

I was tempted to harmonize the log so that it's applied the exact same way but while we have the Exception handy in the current case, the programmatic case gives us a String representation of the exception, or no exception at all. It would be nice if we could provide the stack trace as well when it is configured programmatically but that would mean keeping the Exception reference in the builder and giving an accessor of some kind.

Comment From: snicoll

We've discussed a few options to get back the exception if one has been configured. One option is to expose an exception in the resulting Health. Another option is to set a Throwable for the error attribute rather than its string representation and use a Jackson serializer to serialize the exception the way withException translate it to a string.

With an access to the exception, we could have a common code path that logs a warning consistently.

Comment From: jackhammer2k

I accidentally implemented ttddyy proposal in #24345 and have been pointed to this thread by philwebb.

I still prefer to log failed health checks in WebEndpoints because it will definitely apply to all health checks not only the ones inheriting from AbstractHealthIndicator. Before this thread I did not know that this class exists at all, so all our custom health checks directly implement HealthIndicator. I bet we are not the only ones. ;)

In addition I still see AbstractHealthIndicator only logging in case of an exception in master. I would also expect that the log level is different in case the health check itself generates an error and in the case where it purposely returns "not up".

Comment From: snicoll

In addition I still see AbstractHealthIndicator only logging in case of an exception in master.

Yes, that is to be expected. That's what this issue is going to fix and it is open.

Comment From: jackhammer2k

After discussing with the team, we've decided to log failing indicators individually at the WARN level in all cases, i.e. not just when an exception is thrown.`

I've checked #6c8c850 and maybe I'm wrong, but it does still only log if the health check execution throws an exception.

if (ex != null && this.logger.isWarnEnabled()) {

Because a builder is passed into the method which can be used to directly set the status to DOWN (the intuitive way), I would assume that almost nobody throws an exception to report DOWN without any issue in the assessment of the state (as mentioned in the Javadoc):

     * @throws Exception any {@link Exception} that should create a {@link Status#DOWN}
     * system status.

Comment From: snicoll

I've checked #6c8c850 and maybe I'm wrong, but it does still only log if the health check execution throws an exception.

I don't think it does. The code has been harmonized to check for the presence of an exception in the builder and it is set if you use the builder, or if your throw an exception. There are a number of tests in this commit that prove the behavior change that this issue addresses.

Comment From: jackhammer2k

Sorry, I did assume that the following test in AbstractHealthIndicatorTests would succeed:

    @Test
    void healthCheckWhenDownDoesNotLogHealthCheckFailedMessage(CapturedOutput output) {
        Health heath = new TestHealthIndicator("Test message", (builder) -> builder.down().withDetail("reason","expected").build()).health();
        assertThat(heath.getStatus()).isEqualTo(Status.DOWN);
        assertThat(output).contains("reason").contains("expected");
    }

but looks like the design decision was taken in favor of "an exception has to be set" instead of "status == DOWN is sufficient". Thats now clear for me, however when using Health its not obvious that the favored way of a DOWN-health status is by setting an exception in the builder.

Comment From: snicoll

The purpose of this issue is to log failing call to an health indicator so that you can have access to the call stack. If you chose to mark a service as down using only the status, I don't think we should be logging anything.

Comment From: bbakerman

We are using org.springframework.boot:spring-boot-actuator:3.2.7 and you cannot replace the ReactiveHealthEndpointWebExtension as shown in other answers. We tried this and it failed with

Caused by: java.lang.IllegalStateException: Found multiple extensions for the endpoint bean healthEndpoint (loggingHealthEndpointWebExtension, reactiveHealthEndpointWebExtension)
    at org.springframework.boot.actuate.endpoint.annotation.EndpointDiscoverer.convertToEndpoint(EndpointDiscoverer.java:198)
    at org.springframework.boot.actuate.endpoint.annotation.EndpointDiscoverer.convertToEndpoints(EndpointDiscoverer.java:182)

The current Spring code only logs if the HealthIndicator has an exception in the object. If its JUST down with details say nothing is logged.

The PR https://github.com/spring-projects/spring-boot/pull/33774 shows how it only logs when there is an exception present

BUT if you have 3 checks and one is DOWN- then how do you know which one is down from logs? Sure the returned HTTP data has that but server side teams typically support their code from logs.

So I wrote code outlined here to get around this :

https://stackoverflow.com/questions/54977273/enable-logging-in-spring-boot-actuator-health-check-api/79187607#79187607