We use the spring-cloud-netflix-eureka-server version is 1.4..4.RELEASE. We have 300+ microservice (1500+ instance)on production environment. We have 4 eureka instances that occasionally report a Read timed out exception when the eureka instance replicates data to peer nodes. The interface is ’/eureka/peerreplication/batch‘.
Exception: eureka.cluster.ReplicationTaskProcessor:Network level connection to peer 10.54.54.54;retrying after delay. com.sun.jersey.api.client.ClientHandlerException: java net SocketTimeoutException: Read timed out
Analysis: https://github.com/Netflix/eureka/blob/v1.7.2/eureka-core/src/main/java/com/netflix/eureka/resources/PeerReplicationResource.java
@Path("batch")
@POST
public Response batchReplication(ReplicationList replicationList) {
try {
ReplicationListResponse batchResponse = new ReplicationListResponse();
for (ReplicationInstance instanceInfo : replicationList.getReplicationList()) {
try {
batchResponse.addResponse(dispatch(instanceInfo));
} catch (Exception e) {
batchResponse.addResponse(new ReplicationInstanceResponse(Status.INTERNAL_SERVER_ERROR.getStatusCode(), null));
logger.error("{} request processing failed for batch item {}/{}",
instanceInfo.getAction(), instanceInfo.getAppName(), instanceInfo.getId(), e);
}
}
return Response.ok(batchResponse).build();
} catch (Throwable e) {
logger.error("Cannot execute batch Request", e);
return Response.status(Status.INTERNAL_SERVER_ERROR).build();
}
}
https://github.com/spring-cloud/spring-cloud-netflix/blob/v1.4.4.RELEASE/spring-cloud-netflix-eureka-server/src/main/java/org/springframework/cloud/netflix/eureka/server/InstanceRegistry.java
@Override
public boolean renew(final String appName, final String serverId,
boolean isReplication) {
log("renew " + appName + " serverId " + serverId + ", isReplication {}"
+ isReplication);
List<Application> applications = getSortedApplications();
for (Application input : applications) {
if (input.getName().equals(appName)) {
InstanceInfo instance = null;
for (InstanceInfo info : input.getInstances()) {
if (info.getId().equals(serverId)) {
instance = info;
break;
}
}
publishEvent(new EurekaInstanceRenewedEvent(this, appName, serverId,
instance, isReplication));
break;
}
}
return super.renew(appName, serverId, isReplication);
}
When the number of nodes replication exceeds 200, the ’/eureka/peerreplication/batch’ interface is easily over 200ms.The getSortedApplications() method takes about 1ms to execute. Our temporary solution: when isReplication is true, the getSortedApplications method is not executed and the EurekaInstanceRenewedEvent event is issued.
@Override
public boolean renew(final String appName, final String serverId,
boolean isReplication) {
log("renew " + appName + " serverId " + serverId + ", isReplication {}"
+ isReplication);
if(!isReplication){
List<Application> applications = getSortedApplications();
for (Application input : applications) {
if (input.getName().equals(appName)) {
InstanceInfo instance = null;
for (InstanceInfo info : input.getInstances()) {
if (info.getId().equals(serverId)) {
instance = info;
break;
}
}
publishEvent(new EurekaInstanceRenewedEvent(this, appName, serverId,
instance, isReplication));
break;
}
}
}
return super.renew(appName, serverId, isReplication);
}
Do you have a better way? Thank you!
Comment From: marcingrzejszczak
Can you please check the latest 1.4.7.RELEASE version and see if the problem persists? BTW 1.4.x branch will be not supported soon so we suggest that you upgrade to the latest stable release.
Comment From: qinxiongzhou
After checking the code. The 1.4.7.RELEASE version and the v2.2.0.M1 version also has this probrem
Comment From: qinxiongzhou
@spencergibb Please help me to take a look.
Comment From: spencergibb
Closing this due to inactivity. Please re-open if there's more to discuss.