Hello Spring Cloud Team,
I wanted to raise an issue here on what seems to be a bug.
Setup: The setup is Spring Boot 2.5.1 + Ilford 2020.0.3 + Spring Cloud Config Server (This can be reproduced 100%, even without Spring Cloud Config Client) + Vault Server Backend to protect the secret + Actuator + Spring Boot Admin.
Issue, actual: On each call for /health endpoint (and other actuator endpoints), Spring Cloud Config Server will make (unnecessary) calls to Vault server.
Expected: I think Spring Cloud Config Server should not make calls to Vault server for /health and other non configuration endpoint.
The call to Vault should only happen when a Spring Cloud Config Client registers itself to the client to retrieve the config/secrets, or when a /refresh endpoint was invoked, not always.
Details: We observed this issue when our Vault instance was brought down. During investigation, we observed more than 2000 requests per minute were made against the Vault instance. Upon investigation, we believe every time a /health was called against Spring Cloud Config Server (by Kubernetes health probes) + other apps + when a Spring Cloud Config Client reports itself to Spring Boot admin, Spring Cloud Config Server will unnecessary flood the Vault server.
Logs:
2021-06-20 13:23:10.731 DEBUG 3518 --- [nio-8989-exec-6] o.a.coyote.http11.Http11InputBuffer : Received [GET /health HTTP/1.1
2021-06-20 13:23:10.731 DEBUG 3518 --- [nio-8989-exec-6] o.a.c.authenticator.AuthenticatorBase : Security checking request GET /health
2021-06-20 13:23:10.731 DEBUG 3518 --- [nio-8989-exec-6] org.apache.catalina.realm.RealmBase : No applicable constraints defined
2021-06-20 13:23:10.731 DEBUG 3518 --- [nio-8989-exec-6] o.a.c.authenticator.AuthenticatorBase : Not subject to any constraint
2021-06-20 13:23:10.731 DEBUG 3518 --- [nio-8989-exec-6] org.apache.tomcat.util.http.Parameters : Set encoding to UTF-8
2021-06-20 13:23:10.731 DEBUG 3518 --- [nio-8989-exec-6] o.s.web.servlet.DispatcherServlet : GET "/health", parameters={}
2021-06-20 13:23:10.732 DEBUG 3518 --- [nio-8989-exec-6] s.b.a.e.w.s.WebMvcEndpointHandlerMapping : Mapped to Actuator web endpoint 'health'
2021-06-20 13:23:10.733 DEBUG 3518 --- [nio-8989-exec-6] o.s.web.client.RestTemplate : HTTP GET https://vault.com:443/path/local,vault/data/app
2021-06-20 13:23:10.733 DEBUG 3518 --- [nio-8989-exec-6] o.s.web.client.RestTemplate : Accept=[application/json, application/*+json
2021-06-20 13:23:10.733 DEBUG 3518 --- [nio-8989-exec-6] h.i.c.PoolingHttpClientConnectionManager : Connection request: [route: {s}->https://vault.com:443][total available: 1; route allocated: 1 of 2; total allocated: 1 of 20]
2021-06-20 13:23:10.735 DEBUG 3518 --- [nio-8989-exec-6] o.a.http.impl.execchain.MainClientExec : Executing request GET /path/local,vault/data/app HTTP/1.1
2021-06-20 13:23:10.735 DEBUG 3518 --- [nio-8989-exec-6] o.a.http.impl.execchain.MainClientExec : Target auth state: UNCHALLENGED
2021-06-20 13:23:10.735 DEBUG 3518 --- [nio-8989-exec-6] o.a.http.impl.execchain.MainClientExec : Proxy auth state: UNCHALLENGED
2021-06-20 13:23:10.735 DEBUG 3518 --- [nio-8989-exec-6] org.apache.http.headers : http-outgoing-0 >> GET /path/local,vault/data/app HTTP/1.1
2021-06-20 13:23:10.735 DEBUG 3518 --- [nio-8989-exec-6] org.apache.http.headers : http-outgoing-0 >> Accept: application/json, application/*+json
2021-06-20 13:23:10.735 DEBUG 3518 --- [nio-8989-exec-6] org.apache.http.headers : http-outgoing-0 >> X-Vault-Token: token
2021-06-20 13:23:10.735 DEBUG 3518 --- [nio-8989-exec-6] org.apache.http.headers : http-outgoing-0 >> Host: vault.com:443
2021-06-20 13:23:10.735 DEBUG 3518 --- [nio-8989-exec-6] org.apache.http.headers : http-outgoing-0 >> Connection: Keep-Alive
2021-06-20 13:23:10.735 DEBUG 3518 --- [nio-8989-exec-6] org.apache.http.headers : http-outgoing-0 >> User-Agent: Apache-HttpClient/4.5.13 (Java/11.0.6)
2021-06-20 13:23:10.735 DEBUG 3518 --- [nio-8989-exec-6] org.apache.http.headers : http-outgoing-0 >> Accept-Encoding: gzip,deflate
2021-06-20 13:23:10.735 DEBUG 3518 --- [nio-8989-exec-6] org.apache.http.wire : http-outgoing-0 >> "GET /path/local,vault/data/app HTTP/1.1[\r][\n]"
2021-06-20 13:23:10.735 DEBUG 3518 --- [nio-8989-exec-6] org.apache.http.wire : http-outgoing-0 >> "Accept: application/json, application/*+json[\r][\n]"
2021-06-20 13:23:10.735 DEBUG 3518 --- [nio-8989-exec-6] org.apache.http.wire : http-outgoing-0 >> "X-Vault-Token: token[\r][\n]"
2021-06-20 13:23:10.735 DEBUG 3518 --- [nio-8989-exec-6] org.apache.http.wire : http-outgoing-0 >> "Host: vault.com:443[\r][\n]"
2021-06-20 13:23:10.735 DEBUG 3518 --- [nio-8989-exec-6] org.apache.http.wire : http-outgoing-0 >> "Connection: Keep-Alive[\r][\n]"
2021-06-20 13:23:10.903 DEBUG 3518 --- [nio-8989-exec-6] o.s.web.client.RestTemplate : HTTP GET https://vault.com:443/path/local,vault/data/application
Reproducible project: Please find this link to a simple 4 files only 100% reproducible project.
https://github.com/patpatpat123/vaulconfigserverissue
Spring Cloud Config Server Team, do you mind helping to check why the server will make all those calls to Vault please, especially upon /health or /instances invocation? Is there any workaround now to reduce all those unnecessary calls?
Thank you
Comment From: ryanjbaxter
I don't think I see a bug here. The config server is only healthy if it can reach the backend repository so it needs to check that it can do so.
You can reduce the calls by configuring the health check in k8s.
If you want you can disable the health indicator by setting management.health.config.enabled=false
Comment From: patpatpat123
Hello Ryan,
Thank you for the response.
It is not only the /health endpoint. For many of the actuator endpoints, it will also do the http requests to Vault.
I put an s to http requests, because it is not only one http request to vault, but many, it send requests to /app
endpoint, /application
endpoint of Vault, and few more.
Moreover, this is also happening for all the Spring Boot Admin clients apps, where they "report" themselves to the server, it will invoke the /instances endpoint
, which will also trigger the calls to Vault.
And this is scalling to the number of apps registered, not just Kubernetes liveness probe. if I have 1 client app, it will send periodically requests to Vault, if I have 10 clients, it will times 10, etc.
Comment From: ryanjbaxter
The only think this repo has any control over is the health indicator for the config server and client.
As you can see, we are just calling the findOne
method on the EnvironmentRepository
https://github.com/spring-cloud/spring-cloud-config/blob/e645c802157caa88a8ed50ecb9067335f0c7522f/spring-cloud-config-server/src/main/java/org/springframework/cloud/config/server/config/ConfigServerHealthIndicator.java#L72
For Vault this will eventually make this call https://github.com/spring-cloud/spring-cloud-config/blob/e645c802157caa88a8ed50ecb9067335f0c7522f/spring-cloud-config-server/src/main/java/org/springframework/cloud/config/server/environment/VaultEnvironmentRepository.java#L103
Again outside of the health indicator we have no control over what happens. If there are going to me multiple requests to /health
on the config server you will have to take that into account when considering the requests made to your Vault backend.
We have a refreshRate
property when using Git that has helped people concerned about requests made to the /health
endpoint using that backend, but as far as I know Spring Cloud Vault does not have something similar. @mp911de, correct me if I am wrong here
Comment From: patpatpat123
Thank you Ryan for the explanations.
I believe there might be a pattern here, where Spring Cloud Config Clients, as they scale up in the number of different services, or number of instances of one same service, to put a pressure on Spring Cloud Config Server /health
, and indirectly, on its back end.
While Spring Cloud Config Server is a micro service and can itself scale, or use internal features to handle the load, pressure, it is not always the case of the back end storage.
May I ask if it is possible to consider a mechanism, to control the load and pressure for the back end storage, agnostic of the actual implementation, i.e. the solution will work for Git, Vault, MySql, a file, etc?
Thank you for your consideration, and looking forward to reading your command, as well as @mp911de's opinion.
Comment From: mp911de
Working backward, if an application experiences a high load on the /health
endpoint, this isn't something Spring can cater for, but rather the rate at which these endpoints are hit should be reduced.
I believe there might be a pattern here, where Spring Cloud Config Clients, as they scale up in the number of different services, or number of instances of one same service, to put a pressure on Spring Cloud Config Server /health, and indirectly, on its back end.
This is generally true for anything involved in the health check.
I don't think I see a bug here. The config server is only healthy if it can reach the backend repository so it needs to check that it can do so.
I agree with Ryan's perspective, as Vault is an online backend in contrast to Git which has offline storage. However, depending on what we want to achieve, it makes sense to revisit the current arrangement.
Right now, the health check reads secrets from Vault to perform the health check which isn't ideal. First, the Vault audit log gets populated with each health check, secondly, reading from Vault incorporates potential Vault backend (database, AWS S3, …) resource interaction.
By the way, the integration is designed, Vault data is fetched eagerly. Having means to express the health check should only target the Vault status endpoint would be the ideal approach.
Applying refreshRate
leads to eventual consistency as outcomes would be cached and the health check no longer would report the current state but rather some sort of cached view.
The config server is only healthy if it can reach the backend repository
How about refining healthy if it can reach the backend repository
in the sense of ensuring that Config server uses the most lightweight approach to ensure reachability?
Comment From: patpatpat123
Thank you Mark for your addition here.
I totally agree with your overall analysis, especially the part where the health check in actually reading all secrets.
Maybe an approach where the health endpoint is really just checking Vault's health (which makes sense) but not reading everything?
And maybe an approach where we can control the load to the backend storage, without skew from caches? Something like "for every 10 health check to Spring Cloud Config Server, (configurable) 1 health check to the back end storage, no cache"
Again, thank you both for considering this post.
Comment From: ryanjbaxter
Thanks @mp911de for your insight on this.
@patpatpat123 one thing we did not talk about is configuring the TTL for the health endpoint, configuring this might help https://docs.spring.io/spring-boot/docs/current/reference/html/actuator.html#actuator.endpoints.caching
After reading Mark's response I had a thought of maybe adding a health
method to the EnvironmentRepository
interface, that way if the backend repository has the concept of a health endpoint we can query that. By default in the EnvironmentRepository
interface we can have the implementation call the findOne
method like the health endpoint does today. However if the specific implementation has a concept of health that we can query we could leverage that instead. I see for Vault there is one
https://www.vaultproject.io/api/system/health
(@mp911de not sure if Spring (Cloud) Vault has a way of querying that)
@spencergibb not sure if you have any thoughts here.
Comment From: mp911de
Spring Vault provides health check methods, see VaultSysOperations.health()
.
Comment From: spencergibb
I think that this makes sense. It would require some new API. since the health check simply exercises the repository API.
Maybe an interface an environment repository could implement and the health indicator could check for.
Comment From: ryanjbaxter
@spencergibb that is exactly what I was thinking
Comment From: ryanjbaxter
@spencergibb im thinking more about this enhancement, and the one stumbling block would be what is returned from this new health check API.
Today we get the Environment
and return information about the application name, profile, labels....those are not a big deal....but we also return information about the PropertySources
....that information we wouldn't have unless we got the Environment
and I think that would completely defeat the purpose of a "lighter weight" health check.
Changing the values of whats returned could break someone, so should we wait for Jubilee?