The problem/use-case that the feature addresses

We recently had an accident in our production environment where we got the parameter time unit wrong, mistaking milliseconds for seconds.This accident caused the service of the entire cluster to be abnormal. I collected the time-related configurations in redis.conf and found that most configurations use seconds as the unit, but others use milliseconds as the unit.

second examples :

  • timeout
  • tcp-keepalive
  • save
  • repl-diskless-sync-delay
  • repl-timeout

milliseconds examples :

  • lua-time-limit
  • busy-reply-threshold
  • cluster-node-timeout
  • latency-monitor-threshold

minute:

  • lfu-decay-time

Description of the feature

  1. All time configuration parameters in redis.conf support time unit, such as 15000ms, 15s, 1min
  2. A bool type configuration parameter time-configs-fool-proof that requires all time-related configuration to have units (except when the value is 0)

Alternatives you've considered

To be honest, configuration changes are usually made through a web management system, which can avoid most of these problems, but manual configuration changes are still inevitable for maintenance. Configurations of memory are always safe because everyone knows that the default is bytes.

Comment From: oranagra

so that means to support an optional suffix like we have for memory related configs, but also mark each of these with a default unit in case the suffix is missing (for backwards compatibility). in addition it might also mean we wanna support decimal point (like 0.5s), and in any case, it means the actual variables in the server struct needs to be converted to milliseconds (or even microseconds), and all the code that refers to them updated. besides, some configs may not be able to support fragments of seconds or minutes (when they're matched to something with 1 second frequency, or encoded in a certain way, like maybe the LFU decay), so for these we'll at least need to define a minimum value in the config system.

sounds like a relatively big and complicated project compared to the problem it comes to solve.

Comment From: judeng

sounds like a relatively big and complicated project compared to the problem it comes to solve.

@oranagra Thank you for your reply! I basically agree with you. We only need to solve the problem of user mistyping, so it is best to limit the modification to config.c.

IMO, I don't think we need to keep the behavior consistent with the memory related configs We can limit the time units that different parameters can use. For example, cluster-node-timeout, the only valid suffix is ms.

Comment From: oranagra

so you suggest that the server variables remain the same as they are now (each in different unit), and that the config infra knows the units of each of them. it'll allow a suffix to be added but prevent using things that don't make sense, like using ms for a config that nattily uses seconds. we won't support decimal points, and will have a way to force all configs use a suffix.

i.e. lua-time-limit will be set suing 5000, 5000ms, or 5s but repl-timeout won't support using 60000ms, only 60s. or it can support 60000ms, but not 60001ms.

still sounds overly complicated compared to the problem it comes to solve IMHO

Comment From: judeng

IMO it's worth the effort, considering how easily misconfigured timing can lead to accidents. In our accident, the loss was very large. I‘m sure other users will encounter similar problems in the future. It is impossible for users to distinguish the difference between milliseconds and seconds every time.

we just need to implement then fool-proof code, for simplicity, every time-related configuration will have only one legal suffix. i.e. lua-time-limitcould be set 5000, 5000ms, but 5s is not legal

and thanks for the pr you shared, I just need to add a few flags and judgment logic.

#define OCTAL_CONFIG (1<<2)
/*new flag, We need three flags because we don't want to change other parts except the config,and time-related configurations don't need to be converted to values in the same unit */
#define TIME_MILLISECOND_CONFIG (1<<3)
#define TIME_SECOND_CONFIG (1<<4)
#define TIME_MINUTE_CONFIG (1<<5)

createLongLongConfig("cluster-node-timeout", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.cluster_node_timeout, 15000, TIME_MILLISECOND_CONFIG, NULL, NULL),
static int numericParseString(standardConfig *config, sds value, const char **err, long long *res) {
...
    if (config->data.numeric.flags & TIME_MILLISECOND_CONFIG) {
        //if the time unit in the string is milliseconds
    }
...
}

@oranagra you are the expert on config, please point out my mistakes, thank you my friend

Comment From: oranagra

i'm still not convinced redis should head this way, specifically since this will be an optional feature (in order to avoid a breaking change), which will make it less useful. Human error can happen in many places, for example specifying wrong time unit for volatile keys, but obviously also in many other (not time related) places. i'm not convinced this specific one needs to be handled with such care, and as i said, since it won't be the default, i don't think it'll have high value.

Comment From: Jeremy-Run

Adding more comment information to the configuration file should solve this error, right? unless you deliberately misconfigured

Comment From: soloestoy

Interesting, and I'd love to give some other examples about time. Besides the time-related configurations, some other commands have different unit about timeout parameter:

Keyspace related commands:

  • B[L|R]POP timeout's unit is second
  • BLMPOP timeout's unit is second
  • BLMOVE timeout's unit is second
  • BRPOPLPUSH timeout's unit is second
  • BZPOP[MIN|MAX] timeout's unit is second
  • BZMPOP timeout's unit is second
  • XREAD timeout's unit is millisecond
  • XREADGROUP timeout's unit is millisecond

And as I know some users set very small timeout for XREAD like 1 millisecond, because they misunderstand the unit, and it doesn't make sense since clients will be unblocked very soon.

Admin commands:

  • CLIENT PAUSE timeout's unit is millisecond
  • WAIT timeout's unit is millisecond

From users' perspective timeout with unit is better.

Comment From: judeng

Adding more comment information to the configuration file should solve this error, right? unless you deliberately misconfigured

@Jeremy-Run According to my limited observation, an interesting phenomenon is that rookie SRE will read the comment information again and again to avoid such low-level mistakes, only senior SREs will make such low-level mistakes :)

Comment From: judeng

i'm still not convinced redis should head this way, specifically since this will be an optional feature (in order to avoid a breaking change), which will make it less useful.

@oranagra IMO something is better than nothing, my proposal may be too complicated, I'd like to wait for a better solution in our community with patience.

Human error can happen in many places, for example specifying wrong time unit for volatile keys, but obviously also in many other (not time related) places. i'm not convinced this specific one needs to be handled with such care, and as i said, since it won't be the default, i don't think it'll have high value.

I agree with most of your points, this problem is really not unbearable, there are many other ways to avoid accident, such as the audit log for the config command.