Redis Introduce cpu_usage under CLUSTER SLOT-STATS, to query slot level cpu usage for Redis cluster

This issue is created to de-couple the cpu_usage implementation from the on-going discussion in slot level memory metrics in Introduce slot level metrics to Redis cluster #10472.

What are we introducing

High level changes

cpu_usage will be tracked and introduced under the to-be implemented CLUSTER SLOT-STATS command.

The cpu_usage is calculated by the already tracked duration, which is used to generate INFO commandstats section.

With the introduction of cpu_usage, Redis cluster users are able to identify hot-slots / hot-shards (in association with hot-slots) based on its cpu_usage.

Low level changes

Updated CLUSTER SLOT-STATS response is attached below.

127.0.0.1:6379> CLUSTER SLOT-STATS ORDERBY CPU_USAGE LIMIT 2 DESC
1) (integer) 16381
2) 1) "key_count"
   2) (integer) 2
3) 1) "cpu_usage"
   2) (integer) 1000
4) (integer) 0
5) 1) "key_count"
   2) (integer) 3
6) 1) "cpu_usage"
   2) (integer) 987

Implementation details

Based on the most recent update from the previous thread.

How is it accumulated?

For its initial release, we can leverage CPU time as a proxy unit for CPU utilization. There's already an existing measurement, named duration under call(), which is used to aggregate for an existing counter commandstats. The same value can simply be aggregated under slot level context.

How is it reset?

For its initial release, the accumulated value is reset upon either; 1. slot ownership change (either the slot is removed or newly added), or 2. CONFIG RESETSTAT command. This command already exists, with documentation link.

As for its future iterations, we could leverage trailing average as a better reset mechanism alternative. Even better, make the reset mechanism configurable, similar tomaxmemory-policy config.

Comment From: madolson

For its initial release, we can leverage CPU time as a proxy unit for CPU utilization. There's already an existing measurement, named duration under call(), which is used to aggregate for an existing counter commandstats. The same value can simply be aggregated under slot level context.

'Initial release' implies some type of future release might change CPU, I'm not sure we should do that. I'm okay with just having the cpu usage be the same as the cpu usage indicated in cluster metrics permanently.

Comment From: kyle-yh-kim

Understood. I'd like to tackle this in a different angle - naming convention.

Instead of using cpu_usage, which encapsulates broad utilization of cpu, perhaps it is better to rename as cpu_time.

This way, we achieve the following; 1. Add extensibility towards future cpu metrics as new metrics can be created under cpu_{insert_your_metric_name} namespace, without backward compatibility concerns. 2. Remove ambiguity on "What is cpu usage measured by?"