Describe the bug
Background
The scenario is:
- using a Feign client that is configured by name (e.g.
@FeignClient(name = "downstream-service")) rather than by URL - using Spring-Cloud Loadbalancer with retry enabled (i.e. we're using the
RetryableFeignBlockingLoadBalancerClient) - the downstream service instance(s) are taking longer to respond than the configured Feign client timeout used by the client application.
Correct behaviour
- Occurs when
spring.cloud.loadbalancer.retry.max-retries-on-same-service-instanceis set to 0 - The client application retries according to the configuration of
max-retries-on-next-service-instance, and then at the end logs an error like this:
14:28:45.068 [http-nio-1234-exec-2] ERROR o.a.c.c.C.[.[.[.[dispatcherServlet] - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is feign.RetryableException: Read timed out executing GET http://downstream-service/api/something] with root cause
java.net.SocketTimeoutException: Read timed out
at java.base/sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:283)
...
at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:529)
at feign.Client$Default.convertResponse(Client.java:110)
at feign.Client$Default.execute(Client.java:106)
at org.springframework.cloud.openfeign.loadbalancer.LoadBalancerUtils.executeWithLoadBalancerLifecycleProcessing(LoadBalancerUtils.java:57)
at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.lambda$execute$2(RetryableFeignBlockingLoadBalancerClient.java:168)
at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:225)
at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.execute(RetryableFeignBlockingLoadBalancerClient.java:114)
at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:121)
at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:91)
at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:100)
at jdk.proxy2/jdk.proxy2.$Proxy107.getMeSomething(Unknown Source)
at com.oliverlockwood.retrybug.ClientController.hitTheDownstreamService(ClientController.java:20)
...
which, in my opinion, is perfectly reasonable.
Incorrect behaviour (i.e. the bug)
- Occurs when
spring.cloud.loadbalancer.retry.max-retries-on-same-service-instanceis set to 1 or higher - The client application retries according to the configuration of
max-retries-on-same-service-instance, andmax-retries-on-next-service-instance, and then at the end logs an error like this:
4:30:46.549 [http-nio-1234-exec-2] ERROR o.a.c.c.C.[.[.[.[dispatcherServlet] - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is feign.RetryableException: downstream-service executing GET http://downstream-service/api/something] with root cause
java.net.UnknownHostException: downstream-service
at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:567)
...
at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:529)
at feign.Client$Default.convertResponse(Client.java:110)
at feign.Client$Default.execute(Client.java:106)
at org.springframework.cloud.openfeign.loadbalancer.LoadBalancerUtils.executeWithLoadBalancerLifecycleProcessing(LoadBalancerUtils.java:57)
at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.lambda$execute$2(RetryableFeignBlockingLoadBalancerClient.java:168)
at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:225)
at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.execute(RetryableFeignBlockingLoadBalancerClient.java:114)
at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:121)
at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:91)
at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:100)
at jdk.proxy2/jdk.proxy2.$Proxy107.getMeSomething(Unknown Source)
at com.oliverlockwood.retrybug.ClientController.hitTheDownstreamService(ClientController.java:20)
...
which, in my opinion, is not acceptable. The actual underlying issue is lost, and in its place there's now a spurious, misleading UnknownHostException, because for some reason the implementation tries to hit the literal URL http://downstream-service/api/something, abandoning the whole concept of service discovery.
Workaround
Always set spring.cloud.loadbalancer.retry.max-retries-on-same-service-instance to 0.
Affected versions
I have explicitly proven that this bug reproduces on the following versions:
- Spring-Boot 3.0.3 & Spring-Cloud 2022.0.1
- Spring-Boot 2.7.9 & Spring-Cloud 2021.0.6
- Spring-Boot 2.7.7 & Spring-Cloud 2021.0.5
- Spring-Boot 2.6.2 & Spring-Cloud 2021.0.0
However, I suspect this bug has existed ever since the historical migration from Netflix Ribbon Loadbalancer to Spring-Cloud Loadbalancer.
Note: This is effectively a re-raise of #650, which was opened in December 2021, but closed by the maintainers because a detailed reproduction example was not provided at that time.
Sample
This is demonstrated in https://github.com/oliverlockwood/spring-cloud-openfeign-retry-sameserver-bug. Follow the README therein.
Comment From: OlgaMaciaszek
Thanks, @oliverlockwood . We'll take a look at it.
Comment From: apikozh
Heh! @oliverlockwood, overtook a little 😄 I was in process of writing the same bug request in parallel with you. Found it about a week ago.
The problem is caused by the fact that serviceInstance from the retry context is not set to retrievedServiceInstance, and logic assumes that retrieved service instance is null (absent) and uses original request with originalUri. But originalUri contains FeignClient.url or "http://" + FeignClient.name + ":80", if FeignClient.url is not set. Part of RetryableFeignBlockingLoadBalancerClient
if (context instanceof LoadBalancedRetryContext) {
LoadBalancedRetryContext lbContext = (LoadBalancedRetryContext) context;
ServiceInstance serviceInstance = lbContext.getServiceInstance();
retrievedServiceInstance = serviceInstance; // Should be added!
if (serviceInstance == null) {
...
}
if (retrievedServiceInstance == null) {
...
I'll also leave my sample here, it might be useful: loadbalancer-retry-test.zip
Comment From: oliverlockwood
@apikozh great minds :)
If you've got a proposed fix, why not raise it as a PR?
Comment From: apikozh
Possible fix in #841, but not tested it yet
Comment From: oliverlockwood
For what it's worth, I manually built the fix in #841 locally, and ran my example application using the versions installed from this, and it does appear to me to fix the issue.
I end up with this error log after the retries:
21:11:31.218 [http-nio-1234-exec-2] ERROR o.a.c.c.C.[.[.[.[dispatcherServlet] - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is feign.RetryableException: Read timed out executing GET http://downstream-service/api/something] with root cause
java.net.SocketTimeoutException: Read timed out
at java.base/sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:283)
...
at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:529)
at feign.Client$Default.convertResponse(Client.java:110)
at feign.Client$Default.execute(Client.java:106)
at org.springframework.cloud.openfeign.loadbalancer.LoadBalancerUtils.executeWithLoadBalancerLifecycleProcessing(LoadBalancerUtils.java:57)
at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.lambda$execute$2(RetryableFeignBlockingLoadBalancerClient.java:168)
at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:225)
at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.execute(RetryableFeignBlockingLoadBalancerClient.java:114)
at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:121)
at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:91)
at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:100)
at jdk.proxy2/jdk.proxy2.$Proxy107.getMeSomething(Unknown Source)
at com.oliverlockwood.retrybug.ClientController.hitTheDownstreamService(ClientController.java:20)
...
which looks as it should do, correctly exposing the underlying error.
Comment From: OlgaMaciaszek
Hi @oliverlockwood - thanks for providing the sample and verifying the PR. Makes sense.