Spring Cloud Netflix Discovery Client OUT_OF_SERVICE on startup kubernetes and eureka

Describe the bug

Setup

Spring boot 2.3.7 Spring Cloud Hoxton SR8 Java 11

Observation Caution this could be a race condition and it might be hard to reproduce some times

Application States

Application starts with "Starting" State
Application then changes state to "UP"
Application then changes state briefly to "OUT_OF_SERVICE" this is reported by readiness state health probe
Application reports the state to service-registry-eureka and registers for first time
Application changes state to "UP" and tries to register again to Service-Registry
But Service-Registry continue to think application is "OUT_OF_SERVICE"
Application Status is up and sends heart beat with "UP" status to eureka
But Service-Registry continue to think application is "OUT_OF_SERVICE"

Environment This happens only when the app is deployed in kubernetes We were able to over come this issue by setting following Environment variable - name: "SPRING_MAIN_CLOUD-PLATFORM" value: "NONE"

Comment From: spencergibb

There is a EurekaHealthCheckHandler which maps the boot health status to a eureka InstanceStatus. With the readiness probe in boot this is likely what is causing this. Do you have eureka.client.healthcheck.enabled=true?

Comment From: sabareeshkkanan

I have it enabled , but tried it by disabling it as well but no luck

Comment From: dagerber

I have exactly the same issue in two different projects. Each of these projects has this problem with just one microservice out of ten.

It happend, after we introduced management.endpoint.health.probes.enabled=true. First attempt was to set eureka.client.healthcheck.enabled=true, which didn't help.

The difference with the microservices that always work and the ones, that never get out of the OUT_OF_SERVICE state is, that during startup, a HealthIndicator is run that takes a bit longer (1..3s). If I disable this HealthIndicator, the issue does not occur.

The microservices not working first send a OUT_OF_SERIVCE. So far so good. Problem is, Eureka discards all of the following Heartbeats with state UP and the service is in OUT_OF_SERVICE forever.

We could avoid the problem, by setting eureka.instance.initial-status=starting. Then eureka client does not send OUT_OF_SERVICE during startup and the problem does not occur with the long-running HealthIndicator enabled.

But I think there is a bug that should be fixed in eureka-core with the LeaseExistsRule, which shouldn't filter-out UP

see also attached log registry-service-edited.log

Comment From: Lins-bot

We had the same issue. Removing management.endpoint.health.probes.enabled=true on a non-k8s environnement solved this issue.

Comment From: caszhou

After I debug the program, It was the probe settings cause this issue whether it's in a K8S environment or not.

When I turn off the probe settings, it's ok both K8S environment or not.

The reason why cause this, is the ReadinessStateHealthIndicator return 'OUT_OF_SERVICE' at the first time，and then ReadinessStateHealthIndicator always return 'UP', but the eureka server still displays 'OUT_OF_SERVICE'.

The reason why ReadinessStateHealthIndicator return 'OUT_OF_SERVICE' is InstanceInfoReplicator call the getStatus method before the tomcat startup.

Comment From: JokaZhao

We had the same issue +1

Comment From: OlgaMaciaszek

I agree with @dagerber that the root cause seems to be in Netflix/eureka. Have created an issue there.

Comment From: troshko111

OUT_OF_SERVICE is not a client status and should not be reported by the client, this is the root cause, until the client is ready to take traffic is should report STARTING. Out of service is a special server-side override which by design forces Eureka to ignore the client reported status (whatever it is). The use case for out of service is when say you deploy in AWS and your AWS ASG is marked as disabled, Eureka (server) detects this and marks all instances OUT_OF_SERVICE regardless of what they are reporting themselves (because the entire ASG is supposed to be disabled).

You need to change your registration / probe integration to report STARTING until it's ready.

Valid transitions are

STARTING -> UP UP <-> DOWN

and Any <-> OUT_OF_SERVICE, but it must never be reported by the client and is applied by the server based on the compute group status (like ASG in AWS).

Comment From: mxalis

The current implementation of EurekaHealthCheckHandler maps the Spring Boot Status.OUT_OF_SERVICE to a Eureka InstanceStatus.OUT_OF_SERVICE. If eureka.client.healthcheck.enabled = true, then the OUT_OF_SERVICE status if reported to the Eureka server. Then I suppose if the Spring Boot service recovers from the OUT_OF_SERVICE state back to UP, Eureka server will still keep reporting it as OUT_OF_SERVICE in perpetuity?

Comment From: troshko111

EurekaHealthCheckHandler maps the Spring Boot Status.OUT_OF_SERVICE to a Eureka InstanceStatus.OUT_OF_SERVICE.

This is a problem, you want it to map in a way which only enables these transitions:

STARTING -> UP UP <-> DOWN

I suggest that when the instance starts, it's in STARTING state until it passes the check (stays in STARTING for as long as it needs, whether it's booting or failing the check), then it transitions to UP, from there it can flip between UP and DOWN depending on the check results, but it should never report OUT_OF_SERVICE itself, as this was designed as a server-side override status, not a real instance status.

Comment From: sabareeshkkanan

@OlgaMaciaszek any update on this, appears the issue opened is closed blaming it on the client.

Comment From: jim-olsen

For anyone else who encounters this, I found that setting the initial status did not in fact work in the case of a delayed startup. As a work around, I created and added this class and it fixed it by forcing the state to down initially, thus breaking the out of service cycle and allowing the transition back to up:

/** * There is a bug in the eureka health check indicator that this class works around. Currently, if there is a noticeable * delay in starting the application, the eureka health check returns OUT_OF_SERVICE initially. This is a problem as * when a service is in this state it will ignore all 'up' reports. The fix here is to initially report us as down, * which will override the out of service state, and allow all subsequent transitions to up to occur successfully. */ @Component public class EurekaFix implements HealthIndicator { private static Logger LOG = LoggerFactory.getLogger(EurekaFix.class);

private boolean applicationIsUp = false;

/** * When we receive notification that the application has started, report that we are now in an up state */ @EventListener(ApplicationReadyEvent.class) public void onStartup() { this.applicationIsUp = true; LOG.warn("Application has started, reporting to eureka that application is now available"); }

/** * Force ourselves into a down state while the application is starting, and transition us to an up state once we * have started. Down should override out of service. * @return down if we are not yet fully started, otherwise put us in an up state */ @Override public Health health() { if (!applicationIsUp) { LOG.warn("Reporting application as down to eureka as application has not yet started"); return Health.down().build(); }

return Health.up().build(); } }

Hope this helps someone else out who finds this bug through google like I did.

Comment From: moksamedia

@jim-olsen Thanks! That helped me a ton.

Comment From: lekko

The ReadinessStateHealthIndicator construct with

statusMappings.add(ReadinessState.REFUSING_TRAFFIC, Status.OUT_OF_SERVICE)

But the first time may always got REFUSING_TRAFFIC, then we got OUT_OF_SERVICE. Our clients should never report OUT_OF_SERVICE. This is reserved for server-side status, so once the store client reports as OUT_OF_SERVICE, the Eureka registry stops listening for updates.

After Spring Boot 2.3.2, the k8s env will auto config all probes check by "AvailabilityProbesAutoConfiguration". Or we were able to over come this issue by setting:

management.endpoint.health.probes.enabled=true
management.health.livenessstate.enabled=true
management.health.readinessstate.enabled=true

@jim-olsen those code didn't work in my case. I got two ways to fix this problem:

set management.health.readinessstate.enabled=false
keep default readinessstate settings, and override the ReadinessStateHealthIndicator class like:

package org.springframework.boot.actuate.availability;

public class ReadinessStateHealthIndicator extends AvailabilityStateHealthIndicator {

    public ReadinessStateHealthIndicator(ApplicationAvailability availability) {
        super(availability, ReadinessState.class, (statusMappings) -> {
            statusMappings.add(ReadinessState.ACCEPTING_TRAFFIC, Status.UP);
            statusMappings.add(ReadinessState.REFUSING_TRAFFIC, Status.DOWN);
        });
    }

    @Override
    protected AvailabilityState getState(ApplicationAvailability applicationAvailability) {
        return applicationAvailability.getReadinessState();
    }

}

Comment From: OlgaMaciaszek

Fixed.

Comment From: DidierLoiseau

@OlgaMaciaszek Could there be something wrong with this issue’s milestone? You indicated 3.1.3 but it was already released end of May. Was it 3.1.4 instead? This would match with Spring Cloud 2021.0.4.

Same goes for ##4099, I guess.

Comment From: OlgaMaciaszek

Yes, thanks for reporting this..