Redis [BUG] repl-timeout being reached unnecessarily when master behind Kubernetes proxy

repl-timeout value is used as the timeout of connection between master and replicas as well as for the replication bulk transfer. For connecting, the default 60 seconds is very high and I would set to lower values to avoid reaching timeout when a socket refuses to die. The operation will be retried after all, so no need to wait that long. On most Linux distributions, the TCP timeout will be 130 seconds, meaning we could have a lockup up to this interval in redis instances with higher repl-timeout.

I faced this issue when testing redis replicas attempt to SYNC with masters over Kubernetes ClusterIPs that presented temporarily no endpoints (for example when there's no master in a node for just a few seconds). In this situation we will see a log like this in the replica:

1:S 01 May 2022 19:19:09.964 * Connecting to MASTER myredis-master.default.svc.cluster.local:6379
1:S 01 May 2022 19:19:09.968 * MASTER <-> REPLICA sync started
1:S 01 May 2022 19:20:10.248 # Timeout connecting to the MASTER...
1:S 01 May 2022 19:20:10.248 * Reconnecting to MASTER myredis-master.default.svc.cluster.local:6379 after failure
1:S 01 May 2022 19:20:10.253 * MASTER <-> REPLICA sync started
1:S 01 May 2022 19:20:10.253 * Non blocking connect for SYNC fired the event.
1:S 01 May 2022 19:20:10.254 * Master replied to PING, replication can continue...

The interval between MASTER <-> REPLICA sync started and Timeout connecting to the MASTER... is basically the interval set in repl-timeout. When reduced to a few seconds I would get a fast successful reconnection.

Possible improvement

Investigate if there's no issue on redis side and if the value of 60s is still relevant of just there for historical reasons.

Make clear why the value is high as 60 seconds and if it can be set much lower for establishing connection between master and replica without harming for example the replication bulk transfer behavior.

From redis.conf documentation:

# The following option sets the replication timeout for:
#
# 1) Bulk transfer I/O during SYNC, from the point of view of replica.
# 2) Master timeout from the point of view of replicas (data, pings).
# 3) Replica timeout from the point of view of masters (REPLCONF ACK pings).

Comment From: oranagra

repl-timeout tends to be set to higher values as dataset grows to allow the bulk transfer to succeed.

Why is that? i'm not aware of such practice. What was the master doing during that time? Maybe there's a bug we're overlooking..

in full disk-based replication, it should have been sending newlines while it prepares the rdb file (grep for is_presync).
in full diskless, we should be locked inside rdb parsing so we're not reaching replicationCron and not checking that timeout.
in diskless master and disk-based replica (current default), we update repl_transfer_lastio just before writing any buffer we read into the temp rdb file (grep for write(server.repl_transfer_fd)
in disk-based master and diskless replica we do what i mentioned in [1] (very top of readSyncBulkPayload).

Comment From: eduardobr

@oranagra I'm not sure how obsolete it is, but: https://redis.com/blog/top-redis-headaches-for-devops-replication-timeouts/

Comment From: oranagra

i'm not sure why it was written, it could be outdated, or a misunderstanding. let's look at the facts, not docs.

you did show a log so i assume you reproduced some problem, let's focus on that.. can you tell me which (diskless) configuration you had, how you reproduced it, and what was the master doing at that time?

Comment From: eduardobr

@oranagra I tried all possible combinations of repl-diskless-sync yes|no + repl-diskless-load swapdb|disabled and I get the Timeout connecting to the MASTER... in all of them in a given setup.

It's not quick to reproduce because we need to put master behind a Kubernetes network service (kube-proxy): - Create a Kubernetes cluster - Define 2 pods: 1 redis master and 1 replica connected - Create a network service for master of type ClusterIP - Connect replica to the IP of the ClusterIP

When replica is connected to master directly (through a headless service) we have no issues with socket that refuses to die and replica keeps retrying connection every 1 second. Reconnection is faster in this case.

The issue is when you connect to a ClusterIP, which I suspect doesn't drop the connection if there is no endpoint available, and even after an endpoint is available after connection attempt, it will still timeout on redis side. It makes sense that it needs to retry to find the new endpoint, it shouldn't be stuck until timeout.

For completeness, if I simply get rid of master and leave proxy without endpoints I get a sequence of timeouts (repl-timeout=15 here):

1:S 02 May 2022 19:17:07.571 * Connecting to MASTER myredis-master.default.svc.cluster.local:6379                               │
│ 1:S 02 May 2022 19:17:07.584 * MASTER <-> REPLICA sync started                                                                  │
│ 1:S 02 May 2022 19:17:23.651 # Timeout connecting to the MASTER...                                                              │
│ 1:S 02 May 2022 19:17:23.651 * Reconnecting to MASTER myredis-master.default.svc.cluster.local:6379 after failure               │
│ 1:S 02 May 2022 19:17:23.662 * MASTER <-> REPLICA sync started                                                                  │
│ 1:S 02 May 2022 19:17:39.734 # Timeout connecting to the MASTER...                                                              │
│ 1:S 02 May 2022 19:17:39.734 * Reconnecting to MASTER myredis-master.default.svc.cluster.local:6379 after failure               │
│ 1:S 02 May 2022 19:17:39.760 * MASTER <-> REPLICA sync started                                                                  │
│ 1:S 02 May 2022 19:17:55.836 # Timeout connecting to the MASTER...                                                              │
│ 1:S 02 May 2022 19:17:55.837 * Reconnecting to MASTER myredis-master.default.svc.cluster.local:6379 after failure               │
│ 1:S 02 May 2022 19:17:55.848 * MASTER <-> REPLICA sync started                                                                  │
│ 1:S 02 May 2022 19:17:55.859 # Error condition on socket for SYNC: Network is unreachable                                       │
│ 1:S 02 May 2022 19:17:56.853 * Connecting to MASTER myredis-master.default.svc.cluster.local:6379                               │
│ 1:S 02 May 2022 19:17:56.863 * MASTER <-> REPLICA sync started                                                                  │
│ 1:S 02 May 2022 19:18:12.926 # Timeout connecting to the MASTER...                                                              │
│ 1:S 02 May 2022 19:18:12.926 * Reconnecting to MASTER myredis-master.default.svc.cluster.local:6379 after failure               │
│ 1:S 02 May 2022 19:18:12.940 * MASTER <-> REPLICA sync started                                                                  │
│ 1:S 02 May 2022 19:18:15.992 # Error condition on socket for SYNC: Network is unreachable                                       │
│ 1:S 02 May 2022 19:18:16.956 * Connecting to MASTER myredis-master.default.svc.cluster.local:6379                               │
│ 1:S 02 May 2022 19:18:16.969 * MASTER <-> REPLICA sync started                                                                  │
│ 1:S 02 May 2022 19:18:32.029 # Timeout connecting to the MASTER...                                                              ││ 1:S 02 May 2022 19:18:32.030 * Reconnecting to MASTER myredis-master.default.svc.cluster.local:6379 after failure               ││ 1:S 02 May 2022 19:18:32.092 * MASTER <-> REPLICA sync started                                                                  ││ 1:S 02 May 2022 19:18:48.159 # Timeout connecting to the MASTER...                                                              ││ 1:S 02 May 2022 19:18:48.159 * Reconnecting to MASTER myredis-master.default.svc.cluster.local:6379 after failure               ││ 1:S 02 May 2022 19:18:48.180 * MASTER <-> REPLICA sync started                                                                  ││ 1:S 02 May 2022 19:19:04.249 # Timeout connecting to the MASTER...                                                              ││ 1:S 02 May 2022 19:19:04.250 * Reconnecting to MASTER myredis-master.default.svc.cluster.local:6379 after failure               ││ 1:S 02 May 2022 19:19:04.259 * MASTER <-> REPLICA sync started                                                                  ││ 1:S 02 May 2022 19:19:20.328 # Timeout connecting to the MASTER...                                                              ││ 1:S 02 May 2022 19:19:20.328 * Reconnecting to MASTER myredis-master.default.svc.cluster.local:6379 after failure               ││ 1:S 02 May 2022 19:19:20.340 * MASTER <-> REPLICA sync started                                                                  ││ 1:S 02 May 2022 19:19:35.800 # Error condition on socket for SYNC: Network is unreachable                                       ││ 1:S 02 May 2022 19:19:36.409 * Connecting to MASTER myredis-master.default.svc.cluster.local:6379                               ││ 1:S 02 May 2022 19:19:36.502 * MASTER <-> REPLICA sync started                                                                  ││ 1:S 02 May 2022 19:19:36.512 # Error condition on socket for SYNC: Network is unreachable                                       ││ 1:S 02 May 2022 19:19:37.506 * Connecting to MASTER myredis-master.default.svc.cluster.local:6379                               ││ 1:S 02 May 2022 19:19:37.520 * MASTER <-> REPLICA sync started                                                                  ││ 1:S 02 May 2022 19:19:53.601 # Timeout connecting to the MASTER...                                                              ││ 1:S 02 May 2022 19:19:53.601 * Reconnecting to MASTER myredis-master.default.svc.cluster.local:6379 after failure               ││ 1:S 02 May 2022 19:19:53.611 * MASTER <-> REPLICA sync started

As I mentioned, if I put a very high repl-timeout value in redis, it will timeout in 130 seconds in my machine, which I found to be a common TCP timeout interval from Linux.

Now about the repl-timeout for transfers, I've spend a good amount of time trying to reproduce, even in older redis 2.x and it seems that the article is really misleading, I wonder how it ended up being published in redis.com

Now I want to understand why the default repl-timeout value is so high. Because I could just use very low values to solve this issue if it doesn't cause any harm on other areas (like the bulk transfer).

(Editing description to be more precise)

Comment From: oranagra

So correct me if I'm wrong, the long timeout doesn't happen because a big dataset, but rather due to a broken configuration (it's not a timing issue, this connection will time out even if we give it a week).

I don't know anything about the history of this, or why it is set to a default of 60. A value of 6 sounds just as good to me.

@yossigo maybe you can shed some light on that.

Comment From: eduardobr

@oranagra Exactly, the whole scenario here has always been about establishing connection from replica to master. The story about dataset size was from the misleading page (which made me fear lowering it).

Comment From: yossigo

@oranagra I think that page is misleading indeed. Apparently, in very old versions masters did not send empty lines so replicas would drop if repl-timeout was shorter than the time it takes to BGSAVE, but that's irrelevant for a decade.

Comment From: eduardobr

I've made extra tests in different environments (different OS and Kubernetes distributions) and found 3 behaviors: - Socket takes 130 seconds to die from OS/proxy side (Ubuntu and Raspberry OS, K3s), times out on repl-timeout=60 - Socket takes 20 seconds to die from OS/proxy side (Docker Kubernetes on Windows), logically times out before repl-timeout=60 - Socket is well behaved and redis keeps retrying every 1 second (Azure Kubernetes Service)

I would conclude it's not critical but that we need to revisit the default value. I'd rather use a number like 15 in my setups (just to be higher than repl-ping-replica-period=10):

# It is important to make sure that this value is greater than the value
# specified for repl-ping-replica-period otherwise a timeout will be detected
# every time there is low traffic between the master and the replica. The default
# value is 60 seconds.
#
# repl-timeout 60

Now the question is if the restriction related to repl-ping-replica-period is relevant for all cases where repl-timeout is used in code. If not, sound like repl-timeout value could actually stop being used in a few situations in redis code and just be something like 1s?

Comment From: oranagra

There could be other situations were the server hangs for a moment and doesn't send anything to the replica. Two examples: 1. A very slow command or script blocks the main thread for long and it doesn't send a ping or a newline 2. during diskless serialization, something (compression or A module) takes long to process some data before sending any bytes).

Comment From: eduardobr

Closing this issue as it can be fixed outside redis with proper configuration