Tested on Spring Boot 2.5.5.
I've tried various ways of constructing the uri for reactive WebClient requests, all of them have the same effect - when query parameters are used in the uri, spring boot actuator does not send uri tag in published metrics. For urls with path params metrics work correctly.
Steps to reproduce:
1. clone this repo, import as mvn project: https://github.com/mbanaszkiewicz-vonage/query-param-metric-repro
2. start telegraf container for metrics collection: docker-compose -f docker-compose.telegraf.yml up
. Logs will be available in container's file system /tmp/metrics.out or on console which was used to run the compose command.
3. run com.test.App.java, it will perform 2 requests via reactive WebClient to itself, which will fail with 404s. App logs will show request urls:
- http://localhost:8080/service/endpoint?sampleParam=paramValue
- http://localhost:8080/service/samplePath/endpoint
4. inspect telegraf logs to see the issue, depending on when metric is published you may have to wait a couple of seconds for them to appear. There will be 2 entries of http_client_requests
metric, one that used query param sampleParam=paramValue
does not have the uri tag.
metric for http://localhost:8080/service/endpoint?sampleParam=paramValue
telegraf | http_client_requests,clientName=localhost,cloud=local,environment=local,host=localhost,method=GET,metric_type=timing,outcome=CLIENT_ERROR,status=404 sum=372.653185,upper=372.653185,lower=372.653185,count=1i,90_percentile=372.653185,mean=372.653185,stddev=0 1634712852000000000
metric for http://localhost:8080/service/samplePath/endpoint
telegraf | http_client_requests,clientName=localhost,cloud=local,environment=local,host=localhost,method=GET,metric_type=timing,outcome=CLIENT_ERROR,status=404,uri=/service/{id}/endpoint lower=409.083434,count=1i,90_percentile=409.083434,mean=409.083434,stddev=0,sum=409.083434,upper=409.083434 1634712852000000000
Comment From: wilkinsona
I suspect that this is a problem with the exporting of the metrics to Telegraf over the StatsD protocol. If I expose Boot's /actuator/metrics
endpoint, I can see that both uri
tags have been recorded:
$ http :8080/actuator/metrics/http.client.requests
HTTP/1.1 200 OK
Content-Length: 500
Content-Type: application/vnd.spring-boot.actuator.v3+json
{
"availableTags": [
{
"tag": "method",
"values": [
"GET"
]
},
{
"tag": "clientName",
"values": [
"localhost"
]
},
{
"tag": "uri",
"values": [
"/service/endpoint?sampleParam={paramValue}",
"/service/{id}/endpoint"
]
},
{
"tag": "outcome",
"values": [
"CLIENT_ERROR"
]
},
{
"tag": "status",
"values": [
"404"
]
}
],
"baseUnit": "milliseconds",
"description": "Timer of WebClient operation",
"measurements": [
{
"statistic": "COUNT",
"value": 2.0
},
{
"statistic": "TOTAL_TIME",
"value": 453.630277
},
{
"statistic": "MAX",
"value": 0.0
}
],
"name": "http.client.requests"
}
Comment From: mbanaszkiewicz-vonage
@wilkinsona thanks, shall I close this issue & create a new one for https://github.com/micrometer-metrics/micrometer/ statsd registry or you want to do some additional checking before that?
Comment From: wilkinsona
Micrometer appears to be correctly forming the lines to send to Telegraf. Checking with the debugger, they look something like this:
http_client_requests,clientName=localhost,method=GET,outcome=CLIENT_ERROR,status=404,uri=/service/endpoint?sampleParam={paramValue}:27047.394905|ms
http_client_requests,clientName=localhost,method=GET,outcome=CLIENT_ERROR,status=404,uri=/service/{id}/endpoint:32710.089413|ms
However, Telegraf logs them like this:
telegraf | http_client_requests,clientName=localhost,cloud=local,environment=local,host=localhost,method=GET,metric_type=timing,outcome=CLIENT_ERROR,status=404 stddev=0,sum=27047.394905,upper=27047.394905,lower=27047.394905,count=1i,90_percentile=27047.394905,mean=27047.394905 1634731633000000000
telegraf | http_client_requests,clientName=localhost,cloud=local,environment=local,host=localhost,method=GET,metric_type=timing,outcome=CLIENT_ERROR,status=404,uri=/service/{id}/endpoint stddev=0,sum=32710.089413,upper=32710.089413,lower=32710.089413,count=1i,90_percentile=32710.089413,mean=32710.089413 1634731663000000000
The docs for the line protocol don't seem to prohibit ?
in a tag's value so I'm not quite sure what's happening here. It may be an omission in the documentation and it needs to be escaped/replaced, or it may be a bug in Telegraf.
@mbanaszkiewicz-vonage I think it would be worth raising this with the Micrometer team. Even if it is a bug in Telegraf, they may want to work around it. Thanks very much for taking the time to create such an easy-to-use sample. It made reproducing and diagnosing this much quicker than it otherwise would have been.
Comment From: mbanaszkiewicz-vonage
@wilkinsona thanks a lot for confirming & checking. Will forward this to Micrometer.