I am having problems with a eureka no longer being able to renew with the eureka servers.

Sorry for the long post, but I want to include as much detail as possible. The short version is that the eureka servers are returning errors and as a result, the eureka clients leave sockets open and then can not communicate with the server anymore.

I am using spring-cloud 1.0.0-release on a redhat box running oracle java version "1.7.0_45".

I have a cluster of 2 eureka servers and several clients. Occasionally (not sure of what the trigger is), a client will try to send a heartbeat to the server. The server errors with with a null pointer exception when eureka server tried to update the statistics:

(this is from the eureka server log)
2015-05-21 15:30:18,160 ServoPollScheduler-2 WARN  PollRunnable - - - - failed to send metrics to spring-boot

java.lang.NullPointerException: null

The server then sends back an error message to to client

{"timestamp":1432236643203,"status":500,"error":"Internal Server Error","exception":"java.lang.NullPointerException","message":"No message available","path":"/eureka/apps/BASEPRICE-SERVICE/lcomqnasv15.xxx.com"}

There are 2 threads going on there and that might be part of the problem. A different threads also seems to be hitting the primary eureka server as well.

(this is from the eureka client log)
2015-05-21 15:30:43,206 pool-2-thread-1 WARN  DiscoveryClient - - - - Can't get a response from http://lcomqnasv10:8761/eureka/apps/BASEPRICE-SERVICE/lcomqnasv15.xxx.com
java.lang.RuntimeException: Bad status: 500
        at com.netflix.discovery.DiscoveryClient.makeRemoteCall(DiscoveryClient.java:1155) [06f98804e83cf4a94380b46591b976b1d17c36b8-eureka-client-1.1.147.jar:1.1.147]
        at com.netflix.discovery.DiscoveryClient.makeRemoteCall(DiscoveryClient.java:1060) [06f98804e83cf4a94380b46591b976b1d17c36b8-eureka-client-1.1.147.jar:1.1.147]
        at com.netflix.discovery.DiscoveryClient.access$500(DiscoveryClient.java:105) [06f98804e83cf4a94380b46591b976b1d17c36b8-eureka-client-1.1.147.jar:1.1.147]
        at com.netflix.discovery.DiscoveryClient$HeartbeatThread.run(DiscoveryClient.java:1583) [06f98804e83cf4a94380b46591b976b1d17c36b8-eureka-client-1.1.147.jar:1.1.147]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_45]
        at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_45]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_45]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_45]
        at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45]
2015-05-21 15:30:43,207 pool-2-thread-1 WARN  DiscoveryClient - - - - Trying backup: http://lcomqnasv09:8761/eureka/
2015-05-21 15:30:43,678 pool-3-thread-1 WARN  DiscoveryClient - - - - Can't get a response from http://lcomqnasv10:8761/eureka/apps/delta
com.sun.jersey.api.client.ClientHandlerException: java.util.zip.ZipException: Not in GZIP format
        at com.sun.jersey.api.client.filter.GZIPContentEncodingFilter.handle(GZIPContentEncodingFilter.java:131) ~[jersey-client-1.13.jar!/:1.13]
        at com.netflix.discovery.EurekaIdentityHeaderFilter.handle(EurekaIdentityHeaderFilter.java:28) ~[06f98804e83cf4a94380b46591b976b1d17c36b8-eureka-client-1.1.147.jar:1.1.147]
        at com.sun.jersey.api.client.Client.handle(Client.java:648) ~[jersey-client-1.13.jar!/:1.13]
        at com.sun.jersey.api.client.WebResource.handle(WebResource.java:680) ~[jersey-client-1.13.jar!/:1.13]
        at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74) ~[jersey-client-1.13.jar!/:1.13]
        at com.sun.jersey.api.client.WebResource$Builder.get(WebResource.java:507) ~[jersey-client-1.13.jar!/:1.13]
        at com.netflix.discovery.DiscoveryClient.getUrl(DiscoveryClient.java:1567) [06f98804e83cf4a94380b46591b976b1d17c36b8-eureka-client-1.1.147.jar:1.1.147]
        at com.netflix.discovery.DiscoveryClient.makeRemoteCall(DiscoveryClient.java:1122) [06f98804e83cf4a94380b46591b976b1d17c36b8-eureka-client-1.1.147.jar:1.1.147]
        at com.netflix.discovery.DiscoveryClient.makeRemoteCall(DiscoveryClient.java:1060) [06f98804e83cf4a94380b46591b976b1d17c36b8-eureka-client-1.1.147.jar:1.1.147]
        at com.netflix.discovery.DiscoveryClient.getAndUpdateDelta(DiscoveryClient.java:869) [06f98804e83cf4a94380b46591b976b1d17c36b8-eureka-client-1.1.147.jar:1.1.147]
        at com.netflix.discovery.DiscoveryClient.fetchRegistry(DiscoveryClient.java:748) [06f98804e83cf4a94380b46591b976b1d17c36b8-eureka-client-1.1.147.jar:1.1.147]
        at com.netflix.discovery.DiscoveryClient.access$1400(DiscoveryClient.java:105) [06f98804e83cf4a94380b46591b976b1d17c36b8-eureka-client-1.1.147.jar:1.1.147]
        at com.netflix.discovery.DiscoveryClient$CacheRefreshThread.run(DiscoveryClient.java:1723) [06f98804e83cf4a94380b46591b976b1d17c36b8-eureka-client-1.1.147.jar:1.1.147]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_45]
        at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_45]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_45]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_45]
        at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45]
Caused by: java.util.zip.ZipException: Not in GZIP format
        at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164) ~[na:1.7.0_45]
        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:78) ~[na:1.7.0_45]
        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:90) ~[na:1.7.0_45]
        at com.sun.jersey.api.client.filter.GZIPContentEncodingFilter.handle(GZIPContentEncodingFilter.java:129) ~[jersey-client-1.13.jar!/:1.13]
        ... 17 common frames omitted
2015-05-21 15:30:43,679 pool-3-thread-1 WARN  DiscoveryClient - - - - Trying backup: http://lcomqnasv09:8761/eureka/

The response not being in gzip format seemed weird, so I ran a packet capture and sure enough, it's true. The response header coming from the second call is:

HTTP/1.1 500 Internal Server Error
Server: Apache-Coyote/1.1
Content-Encoding: gzip
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked
Date: Thu, 21 May 2015 19:30:43 GMT
Connection: close

but the response itself is not gzipped in the payload (i can send over the packet capture if it would help.

Somewhere in these errors, the connection to the primary eureka server is not closed completely. When the primary server errors out the first time, we see the normal socket shutdown: The server send the fin to the client and the client responds with a FIN-ACK.

For the gzip error, the socket shutdown is not complete: The server sends a fin, but the client never sends a fin-ack. This leaves the socket in a CLOSE_WAIT state on the client.

If I run a simple: netstat -ant |grep CLOSE_WAIT | wc -l

That number keeps increasing until it hits 100 (50 going to each of the eureka servers). At 50 open connections per eureka server, the client can no longer talk to the server (the client can't get a socket from the pool.

2015-05-21 23:59:26,788 pool-2-thread-1 WARN  DiscoveryClient - - - - Can't get a response from http://lcomqnasv10:8761/eureka/apps/BASEPRICE-SERVICE/lcomqnasv15.xxx.com
com.sun.jersey.api.client.ClientHandlerException: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
        at com.sun.jersey.client.apache4.ApacheHttpClient4Handler.handle(ApacheHttpClient4Handler.java:184) ~[jersey-apache-client4-1.11.jar!/:1.11]
        at com.sun.jersey.api.client.filter.GZIPContentEncodingFilter.handle(GZIPContentEncodingFilter.java:120) ~[jersey-client-1.13.jar!/:1.13]
        at com.netflix.discovery.EurekaIdentityHeaderFilter.handle(EurekaIdentityHeaderFilter.java:28) ~[06f98804e83cf4a94380b46591b976b1d17c36b8-eureka-client-1.1.147.jar:1.1.147]
        at com.sun.jersey.api.client.Client.handle(Client.java:648) ~[jersey-client-1.13.jar!/:1.13]
        at com.sun.jersey.api.client.WebResource.handle(WebResource.java:680) ~[jersey-client-1.13.jar!/:1.13]
        at com.sun.jersey.api.client.WebResource.put(WebResource.java:211) ~[jersey-client-1.13.jar!/:1.13]
        at com.netflix.discovery.DiscoveryClient.makeRemoteCall(DiscoveryClient.java:1097) [06f98804e83cf4a94380b46591b976b1d17c36b8-eureka-client-1.1.147.jar:1.1.147]
        at com.netflix.discovery.DiscoveryClient.makeRemoteCall(DiscoveryClient.java:1060) [06f98804e83cf4a94380b46591b976b1d17c36b8-eureka-client-1.1.147.jar:1.1.147]
        at com.netflix.discovery.DiscoveryClient.access$500(DiscoveryClient.java:105) [06f98804e83cf4a94380b46591b976b1d17c36b8-eureka-client-1.1.147.jar:1.1.147]
        at com.netflix.discovery.DiscoveryClient$HeartbeatThread.run(DiscoveryClient.java:1583) [06f98804e83cf4a94380b46591b976b1d17c36b8-eureka-client-1.1.147.jar:1.1.147]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_45]
        at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_45]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_45]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_45]
        at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45]
Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
        at org.apache.http.impl.conn.tsccm.ConnPoolByRoute.getEntryBlocking(ConnPoolByRoute.java:412) ~[httpclient-4.3.6.jar!/:4.3.6]
        at com.netflix.http4.NamedConnectionPool.getEntryBlocking(NamedConnectionPool.java:141) ~[ribbon-httpclient-2.0-RC13.jar!/:na]
        at org.apache.http.impl.conn.tsccm.ConnPoolByRoute$1.getPoolEntry(ConnPoolByRoute.java:298) ~[httpclient-4.3.6.jar!/:4.3.6]
        at org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager$1.getConnection(ThreadSafeClientConnManager.java:238) ~[httpclient-4.3.6.jar!/:4.3.6]
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423) ~[httpclient-4.3.6.jar!/:4.3.6]
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) ~[httpclient-4.3.6.jar!/:4.3.6]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:115) ~[httpclient-4.3.6.jar!/:4.3.6]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) ~[httpclient-4.3.6.jar!/:4.3.6]
        at com.sun.jersey.client.apache4.ApacheHttpClient4Handler.handle(ApacheHttpClient4Handler.java:170) ~[jersey-apache-client4-1.11.jar!/:1.11]
        ... 14 common frames omitted

Comment From: kkalmbach

looking at it a little closer, one thread is doing a renewal and one thread is doing a get of the deltas. The server is generating a null pointer exception for both calls.

The client sees the null pointer exception just fine for the renewal call and moves on. The nullpointer exception for the delta call gets the nullpointerexception, but with a bad content-encoding and that is the thread that is creating the open connection.

In both call's the eureka server tried closing the conection.

I guess there are 3 things that are coming together to create problems for me: 1. The nullpointer exception on the server 2. The wrong content-encoding being generated on the server. 3. The client leaving a half closed connection when these errors occur.

Comment From: dsyer

Thanks for the analysis. Can you point out the cause of the NPE? Is it easy to fix?

Comment From: kkalmbach

I think the NPE was from servo at PollRunnable, in this code (Line 81 in version 0.7.4 of servo):

 for (MetricObserver o : observers) {
                try {
                    o.update(metrics);
                } catch (Throwable t) {
                    LOGGER.warn("failed to send metrics to " + o.getName(), t);
                }
            }



We did upgraded from Spring-cloud 1.0.0 to 1.0.1 and the problem went away.  I think we can mark this as fixed.

Comment From: vijaymanda

I am also facing the same issue (Eureka client leaving sockets open and those are in CLOSE_WAIT State). Cloud you please solution for closeing these eureka connections with CLOSE_WAIT

Comment From: dsyer

I think if you are still experiencing this or a similar problem it’s probably best to open a new issue.