Redis Redis Master keeps changing during heavy load

Redis Version: 7.2.3 OS: Rocky Linux release 8.8 Redis Mode: Standalone Configuration: 5 nodes with sentinel running on each

Last change we did was upgrade to 7.2.3 from 6.2.6 and also increased number of keys from 2 million to 4.5 million. We're continuously seeing Master node going down and becoming slave, getting below in sentinel logs:

1261695:X 20 Nov 2023 14:54:01.897 # +odown master mymaster x.x.x.156 6379 #quorum 5/2
1261695:X 20 Nov 2023 14:54:29.191 # -sdown master mymaster x.x.x.156 6379
1261695:X 20 Nov 2023 14:54:29.191 # -odown master mymaster x.x.x.156 6379
1261695:X 20 Nov 2023 15:15:19.744 # +sdown master mymaster x.x.x.156 6379
1261695:X 20 Nov 2023 15:15:19.816 # +odown master mymaster x.x.x.156 6379 #quorum 5/2
1261695:X 20 Nov 2023 16:39:01.819 # +sdown master mymaster x.x.x.158 6379
1261695:X 20 Nov 2023 16:39:01.911 # +odown master mymaster x.x.x.158 6379 #quorum 2/2
1261695:X 20 Nov 2023 16:39:03.979 # -sdown master mymaster x.x.x.158 6379
1261695:X 20 Nov 2023 16:39:03.980 # -odown master mymaster x.x.x.158 6379
1261695:X 20 Nov 2023 16:39:35.747 # +sdown master mymaster x.x.x.158 6379
1261695:X 20 Nov 2023 17:45:07.285 # +sdown master mymaster x.x.x.156 6379
1261695:X 20 Nov 2023 17:45:07.343 # +odown master mymaster x.x.x.156 6379 #quorum 4/2
1261695:X 20 Nov 2023 17:45:07.892 # -sdown master mymaster x.x.x.156 6379
1261695:X 20 Nov 2023 17:45:07.893 # -odown master mymaster x.x.x.156 6379
1261695:X 20 Nov 2023 17:45:52.979 # +sdown master mymaster x.x.x.156 6379
1261695:X 20 Nov 2023 19:41:02.280 # +sdown master mymaster x.x.x.158 6379
1261695:X 20 Nov 2023 19:41:02.578 # -sdown master mymaster x.x.x.158 6379
1261695:X 20 Nov 2023 19:41:08.690 # +sdown master mymaster x.x.x.158 6379
1261695:X 20 Nov 2023 19:41:08.763 # +odown master mymaster x.x.x.158 6379 #quorum 5/2
1261695:X 20 Nov 2023 20:19:11.979 # +sdown master mymaster x.x.x.156 6379
1261695:X 20 Nov 2023 20:19:12.909 # -sdown master mymaster x.x.x.156 6379
1261695:X 20 Nov 2023 20:21:30.820 # +sdown master mymaster x.x.x.158 6379
1261695:X 20 Nov 2023 20:21:31.367 # -sdown master mymaster x.x.x.158 6379
1261695:X 20 Nov 2023 20:43:03.314 # +sdown master mymaster x.x.x.158 6379
1261695:X 20 Nov 2023 20:43:03.383 # +odown master mymaster x.x.x.158 6379 #quorum 2/2
1261695:X 20 Nov 2023 20:43:05.506 # -odown master mymaster x.x.x.158 6379
1261695:X 20 Nov 2023 20:43:06.055 # -sdown master mymaster x.x.x.158 6379
1261695:X 20 Nov 2023 20:43:41.098 # +sdown master mymaster x.x.x.158 6379

Not getting much information in redis logs at time when master was marked down:

745984:M 20 Nov 2023 20:38:43.212 * Background saving started by pid 956797
956797:C 20 Nov 2023 20:38:55.180 * DB saved on disk
956797:C 20 Nov 2023 20:38:55.323 * Fork CoW for RDB: current 119 MB, peak 119 MB, average 63 MB
745984:M 20 Nov 2023 20:38:55.501 * Background saving terminated with success
745984:M 20 Nov 2023 20:40:56.035 * 10000 changes in 120 seconds. Saving...
745984:M 20 Nov 2023 20:40:56.190 * Background saving started by pid 957960
957960:C 20 Nov 2023 20:41:07.184 * DB saved on disk
957960:C 20 Nov 2023 20:41:07.311 * Fork CoW for RDB: current 67 MB, peak 67 MB, average 39 MB
745984:M 20 Nov 2023 20:41:07.511 * Background saving terminated with success
745984:M 20 Nov 2023 20:43:05.959 * Connection with replica client id #127367 lost.
745984:M 20 Nov 2023 20:43:05.959 * Connection with replica client id #125482 lost.
745984:M 20 Nov 2023 20:43:05.959 * Connection with replica client id #123807 lost.
745984:M 20 Nov 2023 20:43:08.025 * 10000 changes in 120 seconds. Saving...
745984:M 20 Nov 2023 20:43:08.403 * Background saving started by pid 959624
745984:M 20 Nov 2023 20:43:09.237 * Connection with replica x.x.x.155:6379 lost.
745984:S 20 Nov 2023 20:43:16.027 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.

Issue is observed only during day time when there is higher load and is mostly stable during night time. Servers on which Redis is hosted have 90+ CPU cores and 600 GB RAM. We have checked resource utilisation at time of issue as well and found enough were available.

Average number of clients connecting during the day is 1200-1300, slow logs does not show queries longer than 5 seconds.

We have checked OS logs and did not find any process killing Redis process. We also tried optimising through redis.conf but see no difference.

redis.conf:

bind 0.0.0.0
port 6379
daemonize yes
loadmodule /home/redis/redis-work/redis_sentinel_prod/data/modules/rejson.so
loadmodule /home/redis/redis-work/redis_sentinel_prod/data/modules/redisearch.so TIMEOUT 30000 MAXSEARCHRESULTS -1 MINPREFIX 2
requirepass ""
pidfile "/home/redis/redis-work/redis_sentinel_prod/data/run/redis.pid"
logfile "/home/redis/redis-work/redis_sentinel_prod/data/log/redis.log"
dir "/data/clickhouse/redis-work/redis_sentinel_prod/data"
masterauth ""
# Generated by CONFIG REWRITE
save 3600 1
save 600 100
save 120 10000
latency-tracking-info-percentiles 50 99 99.9
user default on sanitize-payload #ecd71870d1963316a97e3ac3408c9835ad8cf0f3c1bc703527c30265534f75ae ~* &* +@all
rdbcompression no
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit replica 1gb 256mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
timeout 0
disable-thp yes
repl-backlog-size 512mb

Also got this warning in logs but not sure if relevant:

Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
[THP]...WARNING:
You have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled (set to 'madvise' or 'never').

Redis fails over to slave node and recovers but we it takes a while to load DB in memory. This is creating application issues. Can someone please help with this?

Comment From: hpatro

@rishinair19 Do you see any logs like

... for overcoming of output buffer limits

If there is lot of write traffic during BGSAVE, client output buffer could grow more than the threshold (soft/hard) and lead to disconnection of the replica. You could try increasing the client-output-buffer-limit for replica and observe if there is any change in behavior.

Comment From: rishinair19

I had increased client-output-buffer-limit from 256 MB to 1 GB but yeah still getting these in logs:

846301:M 21 Nov 2023 20:58:20.087 # Client id=709662 addr=x.x.x.154:55194 laddr=x.x.x.156:6379 fd=1227 name= age=248 idle=1 flags=S db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=20474 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=17323 omem=355190792 tot-mem=355213192 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= scheduled to be closed ASAP for overcoming of output buffer limits.
846301:M 21 Nov 2023 20:58:20.087 # Client id=709663 addr=x.x.x.155:54728 laddr=x.x.x.156:6379 fd=1230 name= age=248 idle=1 flags=S db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=20474 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=17323 omem=355190792 tot-mem=355213192 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= scheduled to be closed ASAP for overcoming of output buffer limits.
846301:M 22 Nov 2023 03:47:29.001 # Client id=932447 addr=x.x.x.155:16686 laddr=x.x.x.156:6379 fd=171 name= age=82 idle=1 flags=S db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=20474 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=19244 omem=394578976 tot-mem=394601376 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= scheduled to be closed ASAP for overcoming of output buffer limits.

Setting it to 4 GB and will check if we get same messages.

Comment From: rishinair19

I have increased hard limit to 4GB and soft limit to 1 GB, not seeing these messages in logs now:

scheduled to be closed ASAP for overcoming of output buffer limits.

But still seeing the original issue, master node keeps switching.

Logs from master:

75815:M 22 Nov 2023 10:02:05.320 * Background saving started by pid 2744154
2744154:C 22 Nov 2023 10:02:25.998 * DB saved on disk
2744154:C 22 Nov 2023 10:02:26.260 * Fork CoW for RDB: current 74 MB, peak 74 MB, average 41 MB
1075815:M 22 Nov 2023 10:02:26.582 * Background saving terminated with success
1075815:M 22 Nov 2023 10:03:13.129 * Connection with replica client id #158815 lost.
1075815:M 22 Nov 2023 10:03:16.655 * Connection with replica client id #195633 lost.
1075815:S 22 Nov 2023 10:03:26.707 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.

Logs from slave:

745984:M 22 Nov 2023 10:03:12.303 * Connection with master lost.

Any other tuning that can be done to fix this?

Comment From: rishinair19

@sundb @oranagra

Apologies for tagging you guys here but really need assistance as this is affecting production.

Comment From: oranagra

i'm not familiar with sentinel and it's triggers for failover, maybe @moticless can help here. one random tip is to try SLOWLOG GET and LATENCY LATEST (each has a config that needs to be enabled first).

i do think the THP warning could be relevant. i'm not sure why disable-thp yes isn't effective. (we try to disable it specifically for our process using prctl(PR_SET_THP_DISABLE), maybe strace can help revealing why.

Comment From: benimohit

what is your down-after-milliseconds and failover-timeout config options values in sentinel.conf ?

Comment From: hpatro

@rishinair19 Would you be able to share the logs for both primary/replica during the same period time the failover happens ?

Comment From: rishinair19

@oranagra SLOWLOG does report some queries but all of them are under 5ms. Not sure what to check here. I had enabled latency config and ran latency latest and latency doctor commands but they came up with empty results.

I saw the TH warning once in the logs but not again so maybe that resolved now.

@benimohit down-after-milliseconds is 5 seconds, we have not set failover-timeout in sentinel.conf

Comment From: rishinair19

@hpatro Here are the logs:

Sentinel:

1002059:X 23 Nov 2023 01:17:02.299 # +sdown master mymaster x.x.x.158 6379
1002059:X 23 Nov 2023 01:17:02.355 # +odown master mymaster x.x.x.158 6379 #quorum 2/2
1002059:X 23 Nov 2023 01:17:02.564 # -sdown master mymaster x.x.x.158 6379
1002059:X 23 Nov 2023 01:17:02.564 # -odown master mymaster x.x.x.158 6379
1002059:X 23 Nov 2023 01:17:47.627 # +sdown master mymaster x.x.x.158 6379

Master:

745984:M 23 Nov 2023 01:11:44.260 * Background saving started by pid 2987243
2987243:C 23 Nov 2023 01:12:03.834 * DB saved on disk
2987243:C 23 Nov 2023 01:12:04.067 * Fork CoW for RDB: current 125 MB, peak 125 MB, average 70 MB
745984:M 23 Nov 2023 01:12:04.401 * Background saving terminated with success
745984:M 23 Nov 2023 01:14:05.062 * 10000 changes in 120 seconds. Saving...
745984:M 23 Nov 2023 01:14:05.319 * Background saving started by pid 2988540
2988540:C 23 Nov 2023 01:14:24.844 * DB saved on disk
2988540:C 23 Nov 2023 01:14:25.097 * Fork CoW for RDB: current 108 MB, peak 108 MB, average 62 MB
745984:M 23 Nov 2023 01:14:25.456 * Background saving terminated with success
745984:M 23 Nov 2023 01:16:26.011 * 10000 changes in 120 seconds. Saving...
745984:M 23 Nov 2023 01:16:26.379 * Background saving started by pid 2989807
2989807:C 23 Nov 2023 01:16:50.752 * DB saved on disk
2989807:C 23 Nov 2023 01:16:51.081 * Fork CoW for RDB: current 98 MB, peak 98 MB, average 57 MB
745984:M 23 Nov 2023 01:16:51.451 * Background saving terminated with success
745984:M 23 Nov 2023 01:17:02.575 * Connection with replica client id #749489 lost.
745984:M 23 Nov 2023 01:17:03.494 * Connection with replica client id #750466 lost.
745984:S 23 Nov 2023 01:17:13.565 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.

Slave1:

4137974:C 23 Nov 2023 01:15:52.982 * DB saved on disk
4137974:C 23 Nov 2023 01:15:53.166 * Fork CoW for RDB: current 81 MB, peak 81 MB, average 56 MB
3466831:S 23 Nov 2023 01:15:53.423 * Background saving terminated with success
3466831:M 23 Nov 2023 01:17:02.564 * Connection with master lost.
3466831:M 23 Nov 2023 01:17:02.564 * Caching the disconnected master state.
3466831:M 23 Nov 2023 01:17:02.564 * Discarding previously cached master state.
3466831:M 23 Nov 2023 01:17:02.564 * Setting secondary replication ID to eed038f37b227b4a47fb5c39cc39fb32b97f0190, valid up to offset: 30705186096829. New replication ID is 94e7744981d8fa9eac36da82b74b28818e0aaca7
3466831:M 23 Nov 2023 01:17:02.564 * MASTER MODE enabled (user request from 'id=198536 addr=x.x.x.158:60716 laddr=x.x.x.155:6379 fd=52 name=sentinel-6aa66685-cmd age=673 idle=0 flags=x db=0 sub=0 psub=0 ssub=0 multi=4 qbuf=188 qbuf-free=20286 argv-mem=4 multi-mem=169 rbs=1024 rbp=1024 obl=45 oll=0 omem=0 tot-mem=22693 events=r cmd=exec user=default redir=-1 resp=2 lib-name= lib-ver=')
3466831:M 23 Nov 2023 01:17:02.566 * CONFIG REWRITE executed with success.

Comment From: rishinair19

Slave2 with debug logs:

014181:S 23 Nov 2023 01:17:25.184 - DB 0: 7735818 keys (162552 volatile) in 8388608 slots HT.
3014181:S 23 Nov 2023 01:17:25.184 - DB 1: 2 keys (0 volatile) in 4 slots HT.
3014181:S 23 Nov 2023 01:17:25.184 . 11 clients connected (0 replicas), 12496014952 bytes in use
3014181:S 23 Nov 2023 01:17:26.997 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:29.031 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:29.125 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:29.402 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:29.921 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:30.204 - DB 0: 7735818 keys (162552 volatile) in 8388608 slots HT.
3014181:S 23 Nov 2023 01:17:30.204 - DB 1: 2 keys (0 volatile) in 4 slots HT.
3014181:S 23 Nov 2023 01:17:30.204 . 11 clients connected (0 replicas), 12496014968 bytes in use
3014181:S 23 Nov 2023 01:17:30.409 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:31.009 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:31.023 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:31.059 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:31.665 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:31.914 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:31.919 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:35.065 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:35.224 - DB 0: 7735818 keys (162552 volatile) in 8388608 slots HT.
3014181:S 23 Nov 2023 01:17:35.224 - DB 1: 2 keys (0 volatile) in 4 slots HT.
3014181:S 23 Nov 2023 01:17:35.224 . 11 clients connected (0 replicas), 12496014944 bytes in use
3014181:S 23 Nov 2023 01:17:35.657 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:36.404 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:37.919 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:39.379 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:39.663 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:39.985 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:40.244 - DB 0: 7735818 keys (162552 volatile) in 8388608 slots HT.
3014181:S 23 Nov 2023 01:17:40.244 - DB 1: 2 keys (0 volatile) in 4 slots HT.
3014181:S 23 Nov 2023 01:17:40.244 . 11 clients connected (0 replicas), 12496014952 bytes in use
3014181:S 23 Nov 2023 01:17:41.397 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:41.921 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:42.906 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:43.482 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:43.808 - Client closed connection id=83103 addr=x.x.x.158:6379 laddr=x.x.x.154:49122 fd=7 name= age=10476 idle=30 flags=M db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=0 multi-mem=0 rbs=1024 rbp=48 obl=0 oll=0 omem=0 tot-mem=42880 events=r cmd=del user=(superuser) redir=-1 resp=2 lib-name= lib-ver=
3014181:S 23 Nov 2023 01:17:43.808 * Connection with master lost.
3014181:S 23 Nov 2023 01:17:43.808 * Caching the disconnected master state.
3014181:S 23 Nov 2023 01:17:43.808 * Reconnecting to MASTER x.x.x.158:6379
3014181:S 23 Nov 2023 01:17:43.808 * MASTER <-> REPLICA sync started
3014181:S 23 Nov 2023 01:17:43.808 * Non blocking connect for SYNC fired the event.
3014181:S 23 Nov 2023 01:17:43.921 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:44.403 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:45.261 - DB 0: 7735818 keys (162552 volatile) in 8388608 slots HT.
3014181:S 23 Nov 2023 01:17:45.261 - DB 1: 2 keys (0 volatile) in 4 slots HT.
3014181:S 23 Nov 2023 01:17:45.261 . 10 clients connected (0 replicas), 12496054800 bytes in use
3014181:S 23 Nov 2023 01:17:46.124 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:46.917 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:48.409 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:50.006 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:50.278 - DB 0: 7735818 keys (162552 volatile) in 8388608 slots HT.
3014181:S 23 Nov 2023 01:17:50.278 - DB 1: 2 keys (0 volatile) in 4 slots HT.
3014181:S 23 Nov 2023 01:17:50.278 . 10 clients connected (0 replicas), 12496054808 bytes in use
3014181:S 23 Nov 2023 01:17:51.303 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:52.009 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:52.666 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:53.908 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:54.915 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:55.299 - DB 0: 7735818 keys (162552 volatile) in 8388608 slots HT.
3014181:S 23 Nov 2023 01:17:55.299 - DB 1: 2 keys (0 volatile) in 4 slots HT.
3014181:S 23 Nov 2023 01:17:55.299 . 10 clients connected (0 replicas), 12496058912 bytes in use
3014181:S 23 Nov 2023 01:17:56.060 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:56.401 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:56.830 * Master replied to PING, replication can continue...
3014181:S 23 Nov 2023 01:17:57.130 * Trying a partial resynchronization (request eed038f37b227b4a47fb5c39cc39fb32b97f0190:30705197220097).
3014181:S 23 Nov 2023 01:17:57.130 * Master is currently unable to PSYNC but should be in the future: -LOADING Redis is loading the dataset in memory
3014181:S 23 Nov 2023 01:17:57.309 * Connecting to MASTER x.x.x.158:6379
3014181:S 23 Nov 2023 01:17:57.310 * MASTER <-> REPLICA sync started
3014181:S 23 Nov 2023 01:17:57.310 * Non blocking connect for SYNC fired the event.
3014181:S 23 Nov 2023 01:17:57.416 * Master replied to PING, replication can continue...
3014181:S 23 Nov 2023 01:17:57.699 * Trying a partial resynchronization (request eed038f37b227b4a47fb5c39cc39fb32b97f0190:30705197220097).
3014181:S 23 Nov 2023 01:17:57.699 * Master is currently unable to PSYNC but should be in the future: -LOADING Redis is loading the dataset in memory
3014181:S 23 Nov 2023 01:17:58.314 * Connecting to MASTER x.x.x.158:6379
3014181:S 23 Nov 2023 01:17:58.314 * MASTER <-> REPLICA sync started
3014181:S 23 Nov 2023 01:17:58.314 * Non blocking connect for SYNC fired the event.
3014181:S 23 Nov 2023 01:17:58.586 * Master replied to PING, replication can continue...
3014181:S 23 Nov 2023 01:17:58.856 * Trying a partial resynchronization (request eed038f37b227b4a47fb5c39cc39fb32b97f0190:30705197220097).
3014181:S 23 Nov 2023 01:17:59.023 . <search> thread pool contains no more jobs
3014181:S 23 Nov 2023 01:17:59.181 * Master is currently unable to PSYNC but should be in the future: -LOADING Redis is loading the dataset in memory

Comment From: oranagra

can you print INFO STATS DEBUG COMMANDSTATS

Comment From: rishinair19


# Stats
total_connections_received:42525
total_commands_processed:932046546
instantaneous_ops_per_sec:12861
total_net_input_bytes:180508541172
total_net_output_bytes:59867424379
total_net_repl_input_bytes:174877793118
total_net_repl_output_bytes:57688084627
instantaneous_input_kbps:1170.22
instantaneous_output_kbps:4727.00
instantaneous_input_repl_kbps:0.00
instantaneous_output_repl_kbps:4589.36
rejected_connections:0
sync_full:14
sync_partial_ok:6
sync_partial_err:14
expired_keys:138713
expired_stale_perc:8.98
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:3201
evicted_keys:0
evicted_clients:0
total_eviction_exceeded_time:0
current_eviction_exceeded_time:0
keyspace_hits:436433902
keyspace_misses:7756
pubsub_channels:1
pubsub_patterns:59
pubsubshard_channels:0
latest_fork_usec:255960
total_forks:666
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
total_active_defrag_time:0
current_active_defrag_time:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_error_replies:85388
dump_payload_sanitizations:0
total_reads_processed:68706532
total_writes_processed:16133344
io_threaded_reads_processed:0
io_threaded_writes_processed:0
reply_buffer_shrinks:81089
reply_buffer_expands:50808
eventloop_cycles:79125082
eventloop_duration_sum:12999767984
eventloop_duration_cmd_sum:1546657965
instantaneous_eventloop_cycles_per_sec:995
instantaneous_eventloop_duration_usec:66
acl_access_denied_auth:0
acl_access_denied_cmd:0
acl_access_denied_key:0
acl_access_denied_channel:0

# Commandstats
cmdstat_sadd:calls=29,usec=310,usec_per_call=10.69,rejected_calls=0,failed_calls=0
cmdstat_expireat:calls=13427480,usec=10359013,usec_per_call=0.77,rejected_calls=8340,failed_calls=0
cmdstat_dbsize:calls=1,usec=0,usec_per_call=0.00,rejected_calls=0,failed_calls=0
cmdstat_json.set:calls=50458,usec=5654495,usec_per_call=112.06,rejected_calls=14,failed_calls=0
cmdstat_hset:calls=470731301,usec=344883014,usec_per_call=0.73,rejected_calls=6573,failed_calls=0
cmdstat_hmget:calls=38515,usec=155815,usec_per_call=4.05,rejected_calls=0,failed_calls=0
cmdstat_keys:calls=296,usec=610899028,usec_per_call=2063848.00,rejected_calls=0,failed_calls=0
cmdstat_pexpireat:calls=234005085,usec=130585112,usec_per_call=0.56,rejected_calls=0,failed_calls=0
cmdstat_hincrbyfloat:calls=15441379,usec=41105262,usec_per_call=2.66,rejected_calls=8839,failed_calls=0
cmdstat_quit:calls=32618,usec=32646,usec_per_call=1.00,rejected_calls=0,failed_calls=0
cmdstat_module|list:calls=6,usec=24,usec_per_call=4.00,rejected_calls=0,failed_calls=0
cmdstat_bgsave:calls=3,usec=922144,usec_per_call=307381.34,rejected_calls=1,failed_calls=0
cmdstat_select:calls=69,usec=46,usec_per_call=0.67,rejected_calls=0,failed_calls=0
cmdstat_exec:calls=4982,usec=3141638,usec_per_call=630.60,rejected_calls=0,failed_calls=4935
cmdstat_psubscribe:calls=1291,usec=5169,usec_per_call=4.00,rejected_calls=0,failed_calls=0
cmdstat_FT.SEARCH:calls=1452,usec=2736558,usec_per_call=1884.68,rejected_calls=1,failed_calls=0
cmdstat_zrevrange:calls=67,usec=245,usec_per_call=3.66,rejected_calls=0,failed_calls=0
cmdstat_expire:calls=101,usec=240,usec_per_call=2.38,rejected_calls=0,failed_calls=0
cmdstat_command:calls=0,usec=0,usec_per_call=0.00,rejected_calls=1,failed_calls=0
cmdstat_command|docs:calls=0,usec=0,usec_per_call=0.00,rejected_calls=1,failed_calls=0
cmdstat_client|setname:calls=1206,usec=887,usec_per_call=0.74,rejected_calls=0,failed_calls=0
cmdstat_client|kill:calls=84,usec=68535,usec_per_call=815.89,rejected_calls=0,failed_calls=0
cmdstat_ping:calls=515374,usec=370091,usec_per_call=0.72,rejected_calls=44547,failed_calls=0
cmdstat_FT._CREATEIFNX:calls=5,usec=5918,usec_per_call=1183.60,rejected_calls=0,failed_calls=0
cmdstat_config|get:calls=3,usec=286,usec_per_call=95.33,rejected_calls=0,failed_calls=0
cmdstat_config|rewrite:calls=42,usec=96819,usec_per_call=2305.21,rejected_calls=0,failed_calls=0
cmdstat_config|set:calls=4,usec=91,usec_per_call=22.75,rejected_calls=0,failed_calls=0
cmdstat_multi:calls=4893888,usec=260754,usec_per_call=0.05,rejected_calls=0,failed_calls=0
cmdstat_type:calls=5003,usec=2972,usec_per_call=0.59,rejected_calls=0,failed_calls=0
cmdstat_slaveof:calls=42,usec=8521,usec_per_call=202.88,rejected_calls=4935,failed_calls=0
cmdstat_FT._DROPINDEXIFX:calls=5,usec=381,usec_per_call=76.20,rejected_calls=0,failed_calls=0
cmdstat_replconf:calls=9853,usec=6532,usec_per_call=0.66,rejected_calls=0,failed_calls=0
cmdstat_discard:calls=4888906,usec=372641,usec_per_call=0.08,rejected_calls=0,failed_calls=0
cmdstat_publish:calls=629951,usec=2014126,usec_per_call=3.20,rejected_calls=0,failed_calls=0
cmdstat_exists:calls=5555,usec=7958,usec_per_call=1.43,rejected_calls=2,failed_calls=0
cmdstat_subscribe:calls=545,usec=972,usec_per_call=1.78,rejected_calls=0,failed_calls=0
cmdstat_get:calls=15,usec=120,usec_per_call=8.00,rejected_calls=0,failed_calls=0
cmdstat_del:calls=2610381,usec=28356000,usec_per_call=10.86,rejected_calls=0,failed_calls=0
cmdstat_zadd:calls=2,usec=25,usec_per_call=12.50,rejected_calls=0,failed_calls=0
cmdstat_FT._LIST:calls=3,usec=160,usec_per_call=53.33,rejected_calls=0,failed_calls=0
cmdstat_info:calls=161758,usec=8861303,usec_per_call=54.78,rejected_calls=0,failed_calls=0
cmdstat_scan:calls=10,usec=6185,usec_per_call=618.50,rejected_calls=0,failed_calls=0
cmdstat_hmset:calls=241844,usec=19002361,usec_per_call=78.57,rejected_calls=0,failed_calls=0
cmdstat_auth:calls=41210,usec=129030,usec_per_call=3.13,rejected_calls=0,failed_calls=0
cmdstat_json.get:calls=41585,usec=697359,usec_per_call=16.77,rejected_calls=216,failed_calls=465
cmdstat_hincrby:calls=184263700,usec=338943914,usec_per_call=1.84,rejected_calls=6112,failed_calls=0
cmdstat_set:calls=272,usec=2849,usec_per_call=10.47,rejected_calls=0,failed_calls=0
cmdstat_FT.INFO:calls=147,usec=10109,usec_per_call=68.77,rejected_calls=0,failed_calls=0
cmdstat_psync:calls=20,usec=1988,usec_per_call=99.40,rejected_calls=404,failed_calls=0

# Debug
eventloop_duration_aof_sum:2417646
eventloop_duration_cron_sum:273496723
eventloop_duration_max:450214872
eventloop_cmd_per_cycle_max:79590

Comment From: oranagra

eventloop_duration_max:450214872 that's a half a second long event loop, so something inside it was really slow, though maybe not slow enough to cause that issue.

are you sure you enabled the latency monitor correctly? i was suspecting to see module-acquire-GIL (module thread causing delays to redis main thread)

Comment From: rishinair19

I reduced latency-monitor-threshold to 100 1 hour ago.

# Stats
total_connections_received:25190
total_commands_processed:71889064
instantaneous_ops_per_sec:15156
total_net_input_bytes:18862142465
total_net_output_bytes:16416033624
total_net_repl_input_bytes:16390681447
total_net_repl_output_bytes:15885801099
instantaneous_input_kbps:2295.90
instantaneous_output_kbps:9268.53
instantaneous_input_repl_kbps:0.00
instantaneous_output_repl_kbps:9103.22
rejected_connections:0
sync_full:3
sync_partial_ok:5
sync_partial_err:3
expired_keys:71282
expired_stale_perc:7.83
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:1757
evicted_keys:0
evicted_clients:0
total_eviction_exceeded_time:0
current_eviction_exceeded_time:0
keyspace_hits:76381197
keyspace_misses:1359
pubsub_channels:1
pubsub_patterns:54
pubsubshard_channels:0
latest_fork_usec:180626
total_forks:16
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
total_active_defrag_time:0
current_active_defrag_time:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_error_replies:15932
dump_payload_sanitizations:0
total_reads_processed:4708312
total_writes_processed:7860937
io_threaded_reads_processed:0
io_threaded_writes_processed:0
reply_buffer_shrinks:24738
reply_buffer_expands:3825
eventloop_cycles:5860020
eventloop_duration_sum:2402662527
eventloop_duration_cmd_sum:382524183
instantaneous_eventloop_cycles_per_sec:1134
instantaneous_eventloop_duration_usec:216
acl_access_denied_auth:0
acl_access_denied_cmd:0
acl_access_denied_key:0
acl_access_denied_channel:0

# Commandstats
cmdstat_FT.INFO:calls=295,usec=28753,usec_per_call=97.47,rejected_calls=1,failed_calls=0
cmdstat_FT.SEARCH:calls=311,usec=19481,usec_per_call=62.64,rejected_calls=0,failed_calls=0
cmdstat_psubscribe:calls=447,usec=1333,usec_per_call=2.98,rejected_calls=0,failed_calls=0
cmdstat_FT.CREATE:calls=2,usec=820,usec_per_call=410.00,rejected_calls=0,failed_calls=0
cmdstat_keys:calls=138,usec=187348455,usec_per_call=1357597.50,rejected_calls=0,failed_calls=0
cmdstat_replconf:calls=8150,usec=9528,usec_per_call=1.17,rejected_calls=0,failed_calls=0
cmdstat_expireat:calls=5642296,usec=4223250,usec_per_call=0.75,rejected_calls=0,failed_calls=0
cmdstat_hincrby:calls=12281345,usec=22821594,usec_per_call=1.86,rejected_calls=0,failed_calls=0
cmdstat_set:calls=90,usec=406,usec_per_call=4.51,rejected_calls=0,failed_calls=0
cmdstat_exists:calls=1115,usec=2412,usec_per_call=2.16,rejected_calls=0,failed_calls=0
cmdstat_json.del:calls=2,usec=177,usec_per_call=88.50,rejected_calls=0,failed_calls=0
cmdstat_slaveof:calls=5,usec=2279,usec_per_call=455.80,rejected_calls=1392,failed_calls=0
cmdstat_del:calls=4559585,usec=29648784,usec_per_call=6.50,rejected_calls=0,failed_calls=0
cmdstat_ping:calls=126841,usec=86867,usec_per_call=0.68,rejected_calls=12997,failed_calls=0
cmdstat_json.get:calls=7756,usec=154141,usec_per_call=19.87,rejected_calls=2,failed_calls=5
cmdstat_json.set:calls=342,usec=37402,usec_per_call=109.36,rejected_calls=0,failed_calls=0
cmdstat_expire:calls=89,usec=74,usec_per_call=0.83,rejected_calls=0,failed_calls=0
cmdstat_module|list:calls=5,usec=20,usec_per_call=4.00,rejected_calls=0,failed_calls=0
cmdstat_get:calls=2,usec=151,usec_per_call=75.50,rejected_calls=0,failed_calls=0
cmdstat_subscribe:calls=95,usec=230,usec_per_call=2.42,rejected_calls=0,failed_calls=0
cmdstat_FT._CREATEIFNX:calls=3,usec=1162,usec_per_call=387.33,rejected_calls=0,failed_calls=0
cmdstat_multi:calls=2028382,usec=107700,usec_per_call=0.05,rejected_calls=0,failed_calls=0
cmdstat_auth:calls=25055,usec=65818,usec_per_call=2.63,rejected_calls=0,failed_calls=0
cmdstat_info:calls=34339,usec=1446522,usec_per_call=42.12,rejected_calls=0,failed_calls=0
cmdstat_exec:calls=1400,usec=27184727,usec_per_call=19417.66,rejected_calls=0,failed_calls=1392
cmdstat_hset:calls=26097794,usec=18814076,usec_per_call=0.72,rejected_calls=0,failed_calls=0
cmdstat_FT._DROPINDEXIFX:calls=3,usec=352,usec_per_call=117.33,rejected_calls=0,failed_calls=0
cmdstat_quit:calls=19649,usec=15324,usec_per_call=0.78,rejected_calls=0,failed_calls=0
cmdstat_FT.DROP:calls=2,usec=2821840,usec_per_call=1410920.00,rejected_calls=0,failed_calls=0
cmdstat_hincrbyfloat:calls=6477251,usec=16857358,usec_per_call=2.60,rejected_calls=0,failed_calls=0
cmdstat_command:calls=0,usec=0,usec_per_call=0.00,rejected_calls=3,failed_calls=0
cmdstat_command|docs:calls=1,usec=1009,usec_per_call=1009.00,rejected_calls=3,failed_calls=0
cmdstat_psync:calls=8,usec=810,usec_per_call=101.25,rejected_calls=135,failed_calls=0
cmdstat_latency|latest:calls=1,usec=26,usec_per_call=26.00,rejected_calls=0,failed_calls=0
cmdstat_latency|history:calls=2,usec=227,usec_per_call=113.50,rejected_calls=1,failed_calls=0
cmdstat_zrevrange:calls=19,usec=51,usec_per_call=2.68,rejected_calls=0,failed_calls=0
cmdstat_hmset:calls=1510021,usec=92818035,usec_per_call=61.47,rejected_calls=0,failed_calls=0
cmdstat_config|rewrite:calls=5,usec=6012,usec_per_call=1202.40,rejected_calls=0,failed_calls=0
cmdstat_config|get:calls=3,usec=186,usec_per_call=62.00,rejected_calls=0,failed_calls=0
cmdstat_config|set:calls=2,usec=11,usec_per_call=5.50,rejected_calls=0,failed_calls=0
cmdstat_discard:calls=2026982,usec=147407,usec_per_call=0.07,rejected_calls=0,failed_calls=0
cmdstat_select:calls=3261,usec=4121,usec_per_call=1.26,rejected_calls=0,failed_calls=0
cmdstat_FT._LIST:calls=6,usec=519,usec_per_call=86.50,rejected_calls=0,failed_calls=0
cmdstat_hmget:calls=59392,usec=206270,usec_per_call=3.47,rejected_calls=0,failed_calls=0
cmdstat_pexpireat:calls=10921518,usec=6014171,usec_per_call=0.55,rejected_calls=0,failed_calls=0
cmdstat_sadd:calls=2,usec=16,usec_per_call=8.00,rejected_calls=0,failed_calls=0
cmdstat_client|setname:calls=192,usec=187,usec_per_call=0.97,rejected_calls=0,failed_calls=0
cmdstat_client|kill:calls=10,usec=5139,usec_per_call=513.90,rejected_calls=0,failed_calls=0
cmdstat_client|setinfo:calls=1298,usec=1577,usec_per_call=1.21,rejected_calls=0,failed_calls=0
cmdstat_publish:calls=53552,usec=197557,usec_per_call=3.69,rejected_calls=0,failed_calls=0

# Debug
eventloop_duration_aof_sum:177253
eventloop_duration_cron_sum:17850248
eventloop_duration_max:543110229
eventloop_cmd_per_cycle_max:3021786

Output of LATENCY LATEST command:

1) 1) "command"
   2) (integer) 1700730307
   3) (integer) 1100
   4) (integer) 5133
2) 1) "module-acquire-GIL"
   2) (integer) 1700730073
   3) (integer) 191
   4) (integer) 191
3) 1) "fork"
   2) (integer) 1700730369
   3) (integer) 180
   4) (integer) 291

Comment From: rishinair19

SLOW LOG:

 1) 1) (integer) 169
    2) (integer) 1700730545
    3) (integer) 1035128
    4) 1) "KEYS"
       2) "Omitted*"
    5) "x.x.x.95:6837"
    6) ""
 2) 1) (integer) 168
    2) (integer) 1700730544
    3) (integer) 1015822
    4) 1) "KEYS"
       2) "Omitted*"
    5) "x.x.x.95:6837"
    6) ""
 3) 1) (integer) 167
    2) (integer) 1700730540
    3) (integer) 1013108
    4) 1) "KEYS"
       2) "Omitted*"
    5) "x.x.x.95:14600"
    6) ""
 4) 1) (integer) 166
    2) (integer) 1700730539
    3) (integer) 1084553
    4) 1) "KEYS"
       2) "Omitted*"
    5) "x.x.x.95:14600"
    6) ""
 5) 1) (integer) 165
    2) (integer) 1700730538
    3) (integer) 1101298
    4) 1) "KEYS"
       2) "Omitted*"
    5) "x.x.x.95:24068"
    6) ""
 6) 1) (integer) 164
    2) (integer) 1700730537
    3) (integer) 1304204
    4) 1) "KEYS"
       2) "Omitted*"
    5) "x.x.x.95:53897"
    6) ""
 7) 1) (integer) 163
    2) (integer) 1700730488
    3) (integer) 143563
    4) 1) "latency"
       2) "doctor"
    5) "x.x.x.21:65040"
    6) ""
 8) 1) (integer) 162
    2) (integer) 1700730429
    3) (integer) 1087706
    4) 1) "KEYS"
       2) "Omitted*"
    5) "x.x.x.95:4909"
    6) ""
 9) 1) (integer) 161
    2) (integer) 1700730428
    3) (integer) 1088669
    4) 1) "KEYS"
       2) "Omitted*"
    5) "x.x.x.95:4909"
    6) ""
10) 1) (integer) 160
    2) (integer) 1700730422
    3) (integer) 1011626
    4) 1) "KEYS"
       2) "Omitted*"
    5) "x.x.x.95:30936"
    6) ""

Comment From: rishinair19

I have increased client-output-buffer-limit to 10GB now, repl-backlog-size is 512mb. This is happening almost every 30 minutes.

Comment From: rishinair19

Another important observation is that issue occurs when load is high, we were trying load keys and found master changing every 2-3 minutes:

2023-11-23T12:12:58.678Z | [MasterListener-mymaster-[redis:26379]] | INFO | r.clients.jedis.JedisSentinelPool | Created JedisPool to master at x.x.x.155:6379
2023-11-23T12:15:12.655Z | [MasterListener-mymaster-[redis:26379]] | INFO | r.clients.jedis.JedisSentinelPool | Created JedisPool to master at x.x.x.158:6379
2023-11-23T12:16:39.609Z | [MasterListener-mymaster-[redis:26379]] | INFO | r.clients.jedis.JedisSentinelPool | Created JedisPool to master at x.x.x.157:6379

Comment From: rishinair19

We disabled KEYS command and it did reduce load on redis but after a while Redis went down again.

We did some load testing and found no issues when running heavy read queries. Cpu spiked to 100 for a few seconds but comes down immediately.

In write tests, cpu was continuously at 100% and master was marked as down after a few seconds. The child process forked during bgsave also was taking 100% cpu. Disabling bgsave did not create much difference. We had tried increasing io-threads but it causes Redis to crash -

https://github.com/redis/redis/issues/12785

We can't optimize these queries so will try using Redis cluster with multiple masters.

Comment From: moticless

Regarding the sentinel part of this issue, I doubt it is the source cause of the problem. Santinel only samples master and replica instances and step in only when the master is not available for down-after-milliseconds and there is quorum that master is not reachable.

From the logs it is rather obvious that the master is truly unreachable by all entities. You can increase the value of down-after-milliseconds to avoid the failover attempts.

Comment From: oranagra

@moticless so you're saying that sentinel triggered that failover because redis didn't respond to PING for more than 5 seconds. (that part is clear from the logs). no other issues, and the connections weren't dropped.

and by looking at the latency metrics we don't see such long delays so we don't have an explanation why PING got delayed so much..

Comment From: moticless

Ping should be sent once a second. It might be that: 1. Ping got delayed longer than we think. 2. Ping response received within timeframe but the message is not the expected PONG, LOADING or MASTERDOWN. 3. Timeout to switchover is shorter than we think. 4. Something else ...

@rishinair19 , To verify sentinel is working as expected in your environment, maybe you can simulate in similar non production environment a temporary unavailability of master, verify timeout of 5 seconds is as expected, and switchover. If possible, also verify ping packets are being sent during that time. Maybe by kill -STOP PID. Thanks.

Comment From: rishinair19

Sentinel is working as expected. We have manually stopped services and sentinel does fail over Redis to different node time and time again. The issue is that Redis does not respond to PING as CPU core is exhausted on master. We do have multiple cores but since Redis is single threaded, it does not benefit.

Comment From: moticless

Have you observed that the timeout to switchover is indeed 5 seconds and not less? I want to eliminate the option that Sentinel might be the root cause.