Netflix's original version of the Eureka Server avoids to answer clients for a configurable period of time if it starts with an empty registry. This happens when: - for a standalone server - for a clustered server when it fails to transfer registry information from peers
This behaviour is controlled by the waitTimeInMsWhenSyncEmpty property which is set to 5min by default.
However, that feature seems to be broken in the SpringCloud integrated version. The server answers queries immediately after startup - even with an empty registry. Consequences are that a server restart may cause clients to clear their local caches on their next refresh...
The method PeerAwareInstanceRegistry.shouldAllowAccess() is responsible to enforce the warmup period if the registry started empty causing queries to be denied. The interesting part of the code is:
public boolean shouldAllowAccess(boolean remoteRegionRequired) {
if (this.peerInstancesTransferEmptyOnStartup) {
if (!(System.currentTimeMillis() > this.startupTime
+ EUREKA_SERVER_CONFIG.getWaitTimeInMsWhenSyncEmpty())) {
return false;
}
}
The variable startupTime is initially at zero and is initialised only after all attempts to synchronise with peers are over (which may take up to 2.5 minutes to complete: 5 attempts with 30s delay in between). This part of the initialisation is actually done in a separate thread (see EurekaServerInitializerConfiguration.start()) - so the Spring application context finishes its refresh before the Eureka server is fully started.
So startupTime is set to zero for at least the first 2.5 minutes during which shouldAllowAccessEmptyOnStartup let the traffic come in and client requests are then accepted. There is no warmup period anymore.
Conclusion: the restart of a standalone Eureka server (SpringCloud version) may cause clients to clear their local caches.
Comment From: dsyer
Are you proposing a change? The waitTimeInMsWhenSyncEmpty is definitely there and has the same default value.
Comment From: brenuart
- Netflix's Eureka server enforce a warm up period when both in standalone and clustered mode.
- SpringCloud's version does not enforce this warmup when configured standalone, but only when clustered.
So there is a difference in behaviour between the two. This changes is caused by the asynchronous nature of the server bootstrap in SpringCloud. I believe it should be fixed. So yes, this issue is about proposing a change... ;-)
Comment From: dsyer
What kind of change can we make that doesn't leave a bad experience for users getting started with a standalone server? Having the process running but the service down for 5 minutes is not a good experience (hence the async startup). A concrete proposal for what should change really would be appreciated (just looking for help here).
Comment From: brenuart
We made (quite heavy) changes to the bootstrap to avoid the issue. We are still not 100% sure they don't introduce other issues though... I don't have much time right now to describe what we did, but will try to give you a short description this evening.
Comment From: william-tran
I ran into the same issue, my solution was to disable org.springframework.cloud.netflix.eureka.server.EurekaServerInitializerConfiguration.RegistryInstanceProxyInitializer by destroying that bean and implemented my own listener that just calls safeInit(); I'm using eureka strictly standalone with self preservation mode disabled which allowed me to make those changes.
One of the reasons for this is async initialization of eureka wrt to the application context: https://github.com/spring-cloud/spring-cloud-netflix/blob/1.0.3.RELEASE/spring-cloud-netflix-eureka-server/src/main/java/org/springframework/cloud/netflix/eureka/server/EurekaServerInitializerConfiguration.java#L135
The other reason is that we tell eureka that it has > 0 peers even though it has none in our case: https://github.com/spring-cloud/spring-cloud-netflix/blob/1.0.3.RELEASE/spring-cloud-netflix-eureka-server/src/main/java/org/springframework/cloud/netflix/eureka/server/EurekaServerInitializerConfiguration.java#L313
So when initialization calls openForTraffic we are forcing a > 0 count parameter, which means peerInstancesTransferEmptyOnStartup will always be false https://github.com/Netflix/eureka/blob/v1.1.147/eureka-core/src/main/java/com/netflix/eureka/PeerAwareInstanceRegistry.java#L298-L300
and this block is never entered https://github.com/Netflix/eureka/blob/v1.1.147/eureka-core/src/main/java/com/netflix/eureka/PeerAwareInstanceRegistry.java#L392-L397
which means waitTimeInMsWhenSyncEmpty has no effect, and it was designed so clients would not get partial/empty registry info until the server has had enough time to build the registry.
The changes I made do allow waitTimeInMsWhenSyncEmpty to control the "warm up time". Maybe we get rid of TrafficOpener? For standalone instances, the threshold will still be reevaluated when com.netflix.eureka.PeerAwareInstanceRegistry.updateRenewalThreshold() gets called by the TimerTask
Comment From: spencergibb
@william-tran I'm working on upgrading eureka. #594 makes the values you're talking about configurable.
Comment From: spencergibb
@william-tran @brenuart with 1e716ff2c20185421556ef6919a759768dfb38fc there are some configurable values.
@Value("${eureka.server.expectedNumberOfRenewsPerMin:1}")
@Value("${eureka.server.defaultOpenForTrafficCount:1}")
We tend to optimize for standalone mode and let users configure for peering.
Are these enough? Thoughts?
Comment From: brenuart
@spencergibb After a first (quick) look at Brixton.M3, it looks like the two properties you mentioned above have no effect: their value seem to be unconditionally overridden at line https://github.com/spring-cloud/spring-cloud-netflix/blob/master/spring-cloud-netflix-eureka-server/src/main/java/org/springframework/cloud/netflix/eureka/server/EurekaServerConfiguration.java#L129
Comment From: spencergibb
@brenuart dang it. Nice catch. Those aren't supposed to be there :-(
Comment From: spencergibb
@brenuart those have been removed here 415cd1d83c3fcbc3a216ec8a34aab34e1055f8ed. Thanks again.
Comment From: brenuart
A few words about the latest changes...
When started empty (i.e. no peers of failed to transfer data from them), the Eureka server is supposed to refuse queries - but accept new registrations (including renewals) - for waitTimeInMsWhenSyncEmpty. This behaviour is referred to as the warmup time.
Unfortunately, the latest changes do not fix that yet. To fix it we simply overridden the SpringCloud InstanceRegistry as follows:
public FixedInstanceRegistry(
EurekaServerConfig serverConfig, EurekaClientConfig clientConfig,
ServerCodecs serverCodecs, EurekaClient eurekaClient)
{
// Keep Netflix defaults for expectedNumberOfRenewsPerMin and defaultOpenForTrafficCount
super(serverConfig, clientConfig, serverCodecs, eurekaClient, 0, 0);
}
@Override
public void register(InstanceInfo info, int leaseDuration, boolean isReplication) {
super.register(info, leaseDuration, isReplication);
try {
readLock.lock();
if( this.expectedNumberOfRenewsPerMin==0 ) {
synchronized(lock) {
// Adjust value ourselves since PeerAwareInstanceRegistryImpl won't do it if
// expectedNumberOfRenewsPerMin==0 - sounds like a bug when Eureka is
// started empty without peers (not a frequent deployment scenario at Netflix)
this.expectedNumberOfRenewsPerMin = 2;
this.numberOfRenewsPerMinThreshold =
(int) (2 * serverConfig.getRenewalPercentThreshold());
}
// New registration should also count for one renewal
super.renew(info.getAppName(), info.getId(), isReplication);
}
}
finally {
readLock.unlock();
}
}
As you can see, we don't make use of the new properties. Setting them to a value other than 0 at startup will even break things like Eureka's self preservation mode (expected number of renewals won't match the number of registered instances).
According to me, the problem takes its roots in Netflix's implementation of AbstractInstanceRegistry.register(). If you dig into the code, you will notice expectedNUmberOfRenewsThreshold is updated only if expectedNumberOfRenewsPerMin > 0 - which doesn't make much sense if the registry started empty. This piece of code seems to be a copy/paste of what happens in the cancel() method. The problem appears only for standalone registries started empty without any peers - not a standard deployment scenario at Netflix, reason why this problem may not have appeared earlier.
Comment From: william-tran
I set up a load testing app that registers itself with eureka, and continuously loops through fetching the registry from the eureka server and checking to see if its local cache gets emptied or if it still contains an entry for itself. On the latest snapshot, by default, warmup time is not observed, and the load test app sees its local cache (of itself) emptied when I restart the server. Setting eureka.server.defaultOpenForTrafficCount=0 enables the warmup time, and the load testing app never gets its local cache emptied once the server comes back online.
Comment From: spencergibb
@william-tran trying to catch up after being out. What else do we need to do for your requirements?
Comment From: william-tran
I think we're good here on the warmup time, its definitely observed in my test, but @brenuart is right in how expectedNumberOfRenewsThreshold will always be at 0 and never get updated for a standalone server. The only way for the threshold to be updated is if eureka.client.fetchRegistry=true and in the case of a standalone server, eureka.client.serviceUrl.defaultZone=http://127.0.0.1:${server.port}/eureka/, so that updateRenewalThreshold() can fetch registry (itself over loopback) and get a count > 0. I'd regard that as a different bug (and on Eureka's side) however.
Comment From: spencergibb
@brenuart or @william-tran any recommended changes on our side? Is this still a valid issue?
Comment From: william-tran
Setting
eureka.server.defaultOpenForTrafficCount=0enables the warmup time
That still holds, but other than this thread there isn't anything documenting how that property effects warmup behaviour, so documenting it is the only change I'd recommend.
Comment From: alceil
Hey @OlgaMaciaszek I am Ashish Tom, a contributor to OpenForce 2022. I would like to work on this issue. I would be making a PR as soon as I am done with resolving the issue.Can you also guide me on how to get started and give me some pointers. Thank you
Comment From: OlgaMaciaszek
Hi, @alceil, have assigned it to you. If you need any help, please contact the mentors in the spring cloud channel in the OpenForce discord.
Comment From: alceil
Thanks @OlgaMaciaszek
Comment From: OlgaMaciaszek
@alceil Please read carefully through the discussion above. As you will see in the comments, there are certain behaviours that might not be obvious to our users. Specifically, this comment contains useful information that should be added to our docs. You should add it to https://github.com/spring-cloud/spring-cloud-netflix/blob/main/docs/src/main/asciidoc/spring-cloud-netflix.adoc. Possibly the best place for it will be the How to Run a Eureka Server section.
Comment From: alceil
Got it @OlgaMaciaszek
Comment From: alceil
I have raised a pr can you please review it @OlgaMaciaszek