Hello,
This is possibly a follow up of #12787 which didn't offer a solution. And it's also possibly a sentinel issue.
I'm running a 3 nodes redis + sentinel HA setup, which is used as a cachestore for a quite large Moodle installation (~30k users). The redis master node is reach by each webserver through a local HA Proxy instance, which probe for the master node every seconds.
Since november 2023 (and possibly the 7.2.2 update), we discovered that our master was switching to another node every hour. (at the 50minutes mark to be precise). This match a hourly cron job in moodle called "cache_cron_task" which basically clean all stale session in redis.
As soon as this task starts, sentinel marks the master as down, and switchover to another slave. This trigger a small downtime on moodle, because the whole process take ~3-4 seconds to HA Proxy to detect the change of master, and during that time, users get an error about using a Readonly redis instance.
Steps taken to try to solve the issue so far : - removed the snapshot (save statments, thinking it was the isue) - added client-output-buffer-limit to replica to 4gb - added the latency monitor and slow log config to get more information - moved down-after-milliseconds from 2500 to 7500 in sentinel
Redis configuration :
# BEGIN ANSIBLE MANAGED BLOCK
bind *
protected-mode yes
port 6379
tcp-backlog 511
timeout 0
tcp-keepalive 300
daemonize no
supervised systemd
loglevel notice
logfile "/var/log/redis/redis-server.log"
latency-monitor-threshold 100
slowlog-log-slower-than 10000
slowlog-max-len 128
shutdown-on-sigint save
shutdown-on-sigterm save
databases 16
always-show-logo yes
save ""
rdbcompression yes
rdbchecksum yes
client-output-buffer-limit normal 0 0 0
repl-backlog-size 256mb
dbfilename "dump.rdb"
dir "/var/lib/redis"
requirepass "<redacted>"
masterauth "<redacted>"
user default on sanitize-payload <redacted> ~* &* +@all
# END ANSIBLE MANAGED BLOCK
# Generated by CONFIG REWRITE
latency-tracking-info-percentiles 50 99 99.9
user exporter on sanitize-payload <redacted> resetchannels -@all +info +slowlog +latency +config|get +client|setname
client-output-buffer-limit replica 4gb 1gb 60
replicaof 10.98.241.48 6379
Sentinel configuration
```# BEGIN ANSIBLE MANAGED BLOCK
bind 10.98.241.47
port 16379
supervised systemd
sentinel monitor cybprod 10.98.241.48 6379 2
sentinel down-after-milliseconds cybprod 7500
sentinel failover-timeout cybprod 15000
sentinel auth-pass cybprod
END ANSIBLE MANAGED BLOCK
Generated by CONFIG REWRITE
dir "/" latency-tracking-info-percentiles 50 99 99.9 protected-mode no user default on nopass sanitize-payload ~ & +@all sentinel myid e647245612f446e17b475764baaf0414c7569f95 sentinel config-epoch cybprod 1566 sentinel leader-epoch cybprod 1566 sentinel current-epoch 1566
sentinel known-replica cybprod 10.98.241.49 6379
sentinel known-replica cybprod 10.98.241.47 6379
sentinel known-sentinel cybprod 10.98.241.49 16379 616e23ceee4e79333ae24bda72e13f54c03f1b6f
sentinel known-sentinel cybprod 10.98.241.48 16379 e0fc0d9e0d5e15e220c0f5bc289ccb61f529db02
**Interlaced Redis/Sentinel logs of the event on the Master**
689469:X 19 Jan 2024 10:50:10.143 * Sentinel new configuration saved on disk 689469:X 19 Jan 2024 10:50:10.143 # +new-epoch 1566 689469:X 19 Jan 2024 10:50:10.147 * Sentinel new configuration saved on disk 689469:X 19 Jan 2024 10:50:10.147 # +vote-for-leader e0fc0d9e0d5e15e220c0f5bc289ccb61f529db02 1566 689469:X 19 Jan 2024 10:50:10.147 # +sdown master cybprod 10.98.241.49 6379 689469:X 19 Jan 2024 10:50:10.219 # +odown master cybprod 10.98.241.49 6379 #quorum 3/2 689469:X 19 Jan 2024 10:50:10.219 * Next failover delay: I will not start a failover before Fri Jan 19 10:50:40 2024 689469:X 19 Jan 2024 10:50:11.246 - -role-change slave 10.98.241.48:6379 10.98.241.48 6379 @ cybprod 10.98.241.49 6379 new reported role is master 689469:X 19 Jan 2024 10:50:11.274 # +config-update-from sentinel e0fc0d9e0d5e15e220c0f5bc289ccb61f529db02 10.98.241.48 16379 @ cybprod 10.98.241.49 6379 689469:X 19 Jan 2024 10:50:11.274 # +switch-master cybprod 10.98.241.49 6379 10.98.241.48 6379 689469:X 19 Jan 2024 10:50:11.275 * +slave slave 10.98.241.47:6379 10.98.241.47 6379 @ cybprod 10.98.241.48 6379 689469:X 19 Jan 2024 10:50:11.275 * +slave slave 10.98.241.49:6379 10.98.241.49 6379 @ cybprod 10.98.241.48 6379 689469:X 19 Jan 2024 10:50:11.280 * Sentinel new configuration saved on disk
==> /var/log/redis/redis-server.log <== 696770:M 19 Jan 2024 10:50:12.606 * Connection with replica client id #2794945 lost. 696770:M 19 Jan 2024 10:50:12.606 * Connection with replica client id #2794659 lost.
==> /var/log/redis/redis-sentinel.log <== 689469:X 19 Jan 2024 10:50:12.613 - -role-change slave 10.98.241.49:6379 10.98.241.49 6379 @ cybprod 10.98.241.48 6379 new reported role is master 689469:X 19 Jan 2024 10:50:22.629 * +convert-to-slave slave 10.98.241.49:6379 10.98.241.49 6379 @ cybprod 10.98.241.48 6379
==> /var/log/redis/redis-server.log <== 696770:S 19 Jan 2024 10:50:22.629 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer. 696770:S 19 Jan 2024 10:50:22.629 * Connecting to MASTER 10.98.241.48:6379 696770:S 19 Jan 2024 10:50:22.629 * MASTER <-> REPLICA sync started 696770:S 19 Jan 2024 10:50:22.629 * REPLICAOF 10.98.241.48:6379 enabled (user request from 'id=2991088 addr=10.98.241.49:44492 laddr=10.98.241.49:6379 fd=130 name=sentinel-616e23ce-cmd age=10 idle=0 flags=x db=0 sub=0 psub=0 ssub=0 multi=4 qbuf=200 qbuf-free=20274 argv-mem=4 multi-mem=180 rbs=1024 rbp=1024 obl=45 oll=0 omem=0 tot-mem=22704 events=r cmd=exec user=default redir=-1 resp=2 lib-name= lib-ver=') 696770:S 19 Jan 2024 10:50:22.634 * CONFIG REWRITE executed with success. 696770:S 19 Jan 2024 10:50:22.635 * Non blocking connect for SYNC fired the event. 696770:S 19 Jan 2024 10:50:22.635 * Master replied to PING, replication can continue... 696770:S 19 Jan 2024 10:50:22.636 * Trying a partial resynchronization (request 07573dd536f58253100a7568561b3039e6841c45:1716935308098). 696770:S 19 Jan 2024 10:50:27.293 * Full resync from master: b4a757831be0a10f95b688af2514f8a87b0d0fcd:1716933874248 696770:S 19 Jan 2024 10:50:27.491 * MASTER <-> REPLICA sync: receiving streamed RDB from master with EOF to disk
==> /var/log/redis/redis-sentinel.log <== 689469:X 19 Jan 2024 10:50:32.714 - +role-change slave 10.98.241.49:6379 10.98.241.49 6379 @ cybprod 10.98.241.48 6379 new reported role is slave
==> /var/log/redis/redis-server.log <== 696770:S 19 Jan 2024 10:51:11.256 * Discarding previously cached master state. 696770:S 19 Jan 2024 10:51:11.257 * MASTER <-> REPLICA sync: Flushing old data
==> /var/log/redis/redis-sentinel.log <== 689469:X 19 Jan 2024 10:51:19.570 # +sdown slave 10.98.241.49:6379 10.98.241.49 6379 @ cybprod 10.98.241.48 6379
==> /var/log/redis/redis-server.log <== 696770:S 19 Jan 2024 10:51:29.696 * MASTER <-> REPLICA sync: Loading DB in memory 696770:S 19 Jan 2024 10:51:29.702 * Loading RDB produced by version 7.2.4 696770:S 19 Jan 2024 10:51:29.702 * RDB age 62 seconds 696770:S 19 Jan 2024 10:51:29.702 * RDB memory usage when created 3601.71 Mb
==> /var/log/redis/redis-sentinel.log <== 689469:X 19 Jan 2024 10:51:29.779 # -sdown slave 10.98.241.49:6379 10.98.241.49 6379 @ cybprod 10.98.241.48 6379
==> /var/log/redis/redis-server.log <== 696770:S 19 Jan 2024 10:51:48.201 * Done loading RDB, keys loaded: 42220, keys expired: 0. 696770:S 19 Jan 2024 10:51:48.205 * MASTER <-> REPLICA sync: Finished with success
**Interlaced Redis/Sentinel logs of the event on a Slave becoming master**
==> /var/log/redis/redis-sentinel.log <== 729906:X 19 Jan 2024 10:50:10.031 # +sdown master cybprod 10.98.241.49 6379 729906:X 19 Jan 2024 10:50:10.132 # +odown master cybprod 10.98.241.49 6379 #quorum 2/2 729906:X 19 Jan 2024 10:50:10.132 # +new-epoch 1566 729906:X 19 Jan 2024 10:50:10.132 # +try-failover master cybprod 10.98.241.49 6379 729906:X 19 Jan 2024 10:50:10.137 * Sentinel new configuration saved on disk 729906:X 19 Jan 2024 10:50:10.137 # +vote-for-leader e0fc0d9e0d5e15e220c0f5bc289ccb61f529db02 1566 729906:X 19 Jan 2024 10:50:10.147 * 616e23ceee4e79333ae24bda72e13f54c03f1b6f voted for e0fc0d9e0d5e15e220c0f5bc289ccb61f529db02 1566 729906:X 19 Jan 2024 10:50:10.147 * e647245612f446e17b475764baaf0414c7569f95 voted for e0fc0d9e0d5e15e220c0f5bc289ccb61f529db02 1566 729906:X 19 Jan 2024 10:50:10.196 # +elected-leader master cybprod 10.98.241.49 6379 729906:X 19 Jan 2024 10:50:10.196 # +failover-state-select-slave master cybprod 10.98.241.49 6379 729906:X 19 Jan 2024 10:50:10.279 # +selected-slave slave 10.98.241.48:6379 10.98.241.48 6379 @ cybprod 10.98.241.49 6379 729906:X 19 Jan 2024 10:50:10.279 * +failover-state-send-slaveof-noone slave 10.98.241.48:6379 10.98.241.48 6379 @ cybprod 10.98.241.49 6379 729906:X 19 Jan 2024 10:50:10.337 * +failover-state-wait-promotion slave 10.98.241.48:6379 10.98.241.48 6379 @ cybprod 10.98.241.49 6379
==> /var/log/redis/redis-server.log <== 735521:M 19 Jan 2024 10:50:10.338 * Connection with master lost. 735521:M 19 Jan 2024 10:50:10.338 * Caching the disconnected master state. 735521:M 19 Jan 2024 10:50:10.338 * Discarding previously cached master state. 735521:M 19 Jan 2024 10:50:10.338 * Setting secondary replication ID to 07573dd536f58253100a7568561b3039e6841c45, valid up to offset: 1716932440795. New replication ID is b4a757831be0a10f95b688af2514f8a87b0d0fcd 735521:M 19 Jan 2024 10:50:10.338 * MASTER MODE enabled (user request from 'id=3206483 addr=10.98.241.48:54442 laddr=10.98.241.48:6379 fd=25 name=sentinel-e0fc0d9e-cmd age=3487 idle=0 flags=x db=0 sub=0 psub=0 ssub=0 multi=4 qbuf=188 qbuf-free=20286 argv-mem=4 multi-mem=169 rbs=2048 rbp=1024 obl=45 oll=0 omem=0 tot-mem=23717 events=r cmd=exec user=default redir=-1 resp=2 lib-name= lib-ver=') 735521:M 19 Jan 2024 10:50:10.343 * CONFIG REWRITE executed with success.
==> /var/log/redis/redis-sentinel.log <== 729906:X 19 Jan 2024 10:50:11.183 - -role-change slave 10.98.241.48:6379 10.98.241.48 6379 @ cybprod 10.98.241.49 6379 new reported role is master 729906:X 19 Jan 2024 10:50:11.188 * Sentinel new configuration saved on disk 729906:X 19 Jan 2024 10:50:11.188 # +promoted-slave slave 10.98.241.48:6379 10.98.241.48 6379 @ cybprod 10.98.241.49 6379 729906:X 19 Jan 2024 10:50:11.188 # +failover-state-reconf-slaves master cybprod 10.98.241.49 6379 729906:X 19 Jan 2024 10:50:11.273 * +slave-reconf-sent slave 10.98.241.47:6379 10.98.241.47 6379 @ cybprod 10.98.241.49 6379
==> /var/log/redis/redis-server.log <== 735521:M 19 Jan 2024 10:50:11.282 * Replica 10.98.241.47:6379 asks for synchronization 735521:M 19 Jan 2024 10:50:11.282 * Partial resynchronization request from 10.98.241.47:6379 accepted. Sending 455 bytes of backlog starting from offset 1716932440795.
==> /var/log/redis/redis-sentinel.log <== 729906:X 19 Jan 2024 10:50:12.247 * +slave-reconf-inprog slave 10.98.241.47:6379 10.98.241.47 6379 @ cybprod 10.98.241.49 6379 729906:X 19 Jan 2024 10:50:12.247 * +slave-reconf-done slave 10.98.241.47:6379 10.98.241.47 6379 @ cybprod 10.98.241.49 6379 729906:X 19 Jan 2024 10:50:12.310 # -odown master cybprod 10.98.241.49 6379 729906:X 19 Jan 2024 10:50:12.310 # +failover-end master cybprod 10.98.241.49 6379 729906:X 19 Jan 2024 10:50:12.310 # +switch-master cybprod 10.98.241.49 6379 10.98.241.48 6379 729906:X 19 Jan 2024 10:50:12.310 * +slave slave 10.98.241.47:6379 10.98.241.47 6379 @ cybprod 10.98.241.48 6379 729906:X 19 Jan 2024 10:50:12.310 * +slave slave 10.98.241.49:6379 10.98.241.49 6379 @ cybprod 10.98.241.48 6379 729906:X 19 Jan 2024 10:50:12.314 * Sentinel new configuration saved on disk 729906:X 19 Jan 2024 10:50:12.612 - -role-change slave 10.98.241.49:6379 10.98.241.49 6379 @ cybprod 10.98.241.48 6379 new reported role is master
==> /var/log/redis/redis-server.log <== 735521:M 19 Jan 2024 10:50:22.636 * Replica 10.98.241.49:6379 asks for synchronization 735521:M 19 Jan 2024 10:50:22.636 * Partial resynchronization not accepted: Requested offset for second ID was 1716935308098, but I can reply up to 1716932440795 735521:M 19 Jan 2024 10:50:22.636 * Delay next BGSAVE for diskless SYNC
==> /var/log/redis/redis-sentinel.log <== 729906:X 19 Jan 2024 10:50:22.707 - +role-change slave 10.98.241.49:6379 10.98.241.49 6379 @ cybprod 10.98.241.48 6379 new reported role is slave
==> /var/log/redis/redis-server.log <== 735521:M 19 Jan 2024 10:50:27.293 * Starting BGSAVE for SYNC with target: replicas sockets 735521:M 19 Jan 2024 10:50:27.381 * Background RDB transfer started by pid 1028807 1028807:C 19 Jan 2024 10:51:11.312 * Fork CoW for RDB: current 19 MB, peak 19 MB, average 13 MB 735521:M 19 Jan 2024 10:51:11.313 * Diskless rdb transfer, done reading from pipe, 1 replicas still up. 735521:M 19 Jan 2024 10:51:11.391 * Background RDB transfer terminated with success 735521:M 19 Jan 2024 10:51:11.391 * Streamed RDB transfer with replica 10.98.241.49:6379 succeeded (socket). Waiting for REPLCONF ACK from replica to enable streaming 735521:M 19 Jan 2024 10:51:11.391 * Synchronization with replica 10.98.241.49:6379 succeeded
==> /var/log/redis/redis-sentinel.log <== 729906:X 19 Jan 2024 10:51:18.934 # +sdown slave 10.98.241.49:6379 10.98.241.49 6379 @ cybprod 10.98.241.48 6379 729906:X 19 Jan 2024 10:51:29.754 # -sdown slave 10.98.241.49:6379 10.98.241.49 6379 @ cybprod 10.98.241.48 6379
**Interlaced Redis/Sentinel logs of the event on a Slave staying a Slave**
691689:X 19 Jan 2024 10:50:09.386 # +sdown master cybprod 10.98.241.49 6379 691689:X 19 Jan 2024 10:50:10.143 * Sentinel new configuration saved on disk 691689:X 19 Jan 2024 10:50:10.143 # +new-epoch 1566 691689:X 19 Jan 2024 10:50:10.147 * Sentinel new configuration saved on disk 691689:X 19 Jan 2024 10:50:10.148 # +vote-for-leader e0fc0d9e0d5e15e220c0f5bc289ccb61f529db02 1566 691689:X 19 Jan 2024 10:50:10.479 # +odown master cybprod 10.98.241.49 6379 #quorum 3/2 691689:X 19 Jan 2024 10:50:10.479 * Next failover delay: I will not start a failover before Fri Jan 19 10:50:41 2024 691689:X 19 Jan 2024 10:50:10.480 - -role-change slave 10.98.241.48:6379 10.98.241.48 6379 @ cybprod 10.98.241.49 6379 new reported role is master
==> /var/log/redis/redis-server.log <== 699343:M 19 Jan 2024 10:50:11.274 * Connection with master lost. 699343:M 19 Jan 2024 10:50:11.274 * Caching the disconnected master state. 699343:S 19 Jan 2024 10:50:11.274 * Connecting to MASTER 10.98.241.48:6379
==> /var/log/redis/redis-sentinel.log <== 691689:X 19 Jan 2024 10:50:11.274 # +config-update-from sentinel e0fc0d9e0d5e15e220c0f5bc289ccb61f529db02 10.98.241.48 16379 @ cybprod 10.98.241.49 6379 691689:X 19 Jan 2024 10:50:11.274 # +switch-master cybprod 10.98.241.49 6379 10.98.241.48 6379
==> /var/log/redis/redis-server.log <== 699343:S 19 Jan 2024 10:50:11.274 * MASTER <-> REPLICA sync started 699343:S 19 Jan 2024 10:50:11.274 * REPLICAOF 10.98.241.48:6379 enabled (user request from 'id=300243 addr=10.98.241.48:52966 laddr=10.98.241.47:6379 fd=50 name=sentinel-e0fc0d9e-cmd age=3529 idle=0 flags=x db=0 sub=0 psub=0 ssub=0 multi=4 qbuf=344 qbuf-free=20130 argv-mem=4 multi-mem=180 rbs=2048 rbp=2048 obl=45 oll=0 omem=0 tot-mem=23728 events=r cmd=exec user=default redir=-1 resp=2 lib-name= lib-ver=')
==> /var/log/redis/redis-sentinel.log <== 691689:X 19 Jan 2024 10:50:11.274 * +slave slave 10.98.241.47:6379 10.98.241.47 6379 @ cybprod 10.98.241.48 6379 691689:X 19 Jan 2024 10:50:11.274 * +slave slave 10.98.241.49:6379 10.98.241.49 6379 @ cybprod 10.98.241.48 6379 691689:X 19 Jan 2024 10:50:11.279 * Sentinel new configuration saved on disk
==> /var/log/redis/redis-server.log <== 699343:S 19 Jan 2024 10:50:11.281 * CONFIG REWRITE executed with success. 699343:S 19 Jan 2024 10:50:11.281 * Non blocking connect for SYNC fired the event. 699343:S 19 Jan 2024 10:50:11.281 * Master replied to PING, replication can continue... 699343:S 19 Jan 2024 10:50:11.282 * Trying a partial resynchronization (request 07573dd536f58253100a7568561b3039e6841c45:1716932440795). 699343:S 19 Jan 2024 10:50:11.282 * Successful partial resynchronization with master. 699343:S 19 Jan 2024 10:50:11.282 * Master replication ID changed to b4a757831be0a10f95b688af2514f8a87b0d0fcd 699343:S 19 Jan 2024 10:50:11.282 * MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization.
==> /var/log/redis/redis-sentinel.log <== 691689:X 19 Jan 2024 10:50:12.613 - -role-change slave 10.98.241.49:6379 10.98.241.49 6379 @ cybprod 10.98.241.48 6379 new reported role is master 691689:X 19 Jan 2024 10:50:22.669 - +role-change slave 10.98.241.49:6379 10.98.241.49 6379 @ cybprod 10.98.241.48 6379 new reported role is slave 691689:X 19 Jan 2024 10:51:19.154 # +sdown slave 10.98.241.49:6379 10.98.241.49 6379 @ cybprod 10.98.241.48 6379 691689:X 19 Jan 2024 10:51:29.797 # -sdown slave 10.98.241.49:6379 10.98.241.49 6379 @ cybprod 10.98.241.48 6379
**INFO STATS DEBUG COMMANDSTATS on Master**
Stats
total_connections_received:3267544 total_commands_processed:327900574 instantaneous_ops_per_sec:3652 total_net_input_bytes:150770855053 total_net_output_bytes:529880085501 total_net_repl_input_bytes:88810132201 total_net_repl_output_bytes:129822712413 instantaneous_input_kbps:587.00 instantaneous_output_kbps:7565.21 instantaneous_input_repl_kbps:0.00 instantaneous_output_repl_kbps:478.53 rejected_connections:0 sync_full:30 sync_partial_ok:26 sync_partial_err:30 expired_keys:270675 expired_stale_perc:0.98 expired_time_cap_reached_count:0 expire_cycle_cpu_milliseconds:17602 evicted_keys:0 evicted_clients:0 total_eviction_exceeded_time:0 current_eviction_exceeded_time:0 keyspace_hits:291749482 keyspace_misses:464580 pubsub_channels:1 pubsub_patterns:0 pubsubshard_channels:0 latest_fork_usec:88796 total_forks:30 migrate_cached_sockets:0 slave_expires_tracked_keys:0 active_defrag_hits:0 active_defrag_misses:0 active_defrag_key_hits:0 active_defrag_key_misses:0 total_active_defrag_time:0 current_active_defrag_time:0 tracking_total_keys:0 tracking_total_items:0 tracking_total_prefixes:0 unexpected_error_replies:0 total_error_replies:5060 dump_payload_sanitizations:0 total_reads_processed:334812878 total_writes_processed:343661951 io_threaded_reads_processed:0 io_threaded_writes_processed:0 reply_buffer_shrinks:276533 reply_buffer_expands:97355 eventloop_cycles:303143416 eventloop_duration_sum:7181668348 eventloop_duration_cmd_sum:869364914 instantaneous_eventloop_cycles_per_sec:3410 instantaneous_eventloop_duration_usec:26 acl_access_denied_auth:0 acl_access_denied_cmd:0 acl_access_denied_key:0 acl_access_denied_channel:0
Commandstats
cmdstat_setex:calls=830120,usec=6202707,usec_per_call=7.47,rejected_calls=16,failed_calls=0 cmdstat_zrem:calls=19763,usec=144802,usec_per_call=7.33,rejected_calls=0,failed_calls=0 cmdstat_set:calls=755657,usec=5323839,usec_per_call=7.05,rejected_calls=0,failed_calls=0 cmdstat_setnx:calls=3180900,usec=12571596,usec_per_call=3.95,rejected_calls=771,failed_calls=0 cmdstat_replconf:calls=195629,usec=217924,usec_per_call=1.11,rejected_calls=0,failed_calls=0 cmdstat_hmget:calls=1915596,usec=8181110,usec_per_call=4.27,rejected_calls=0,failed_calls=0 cmdstat_zadd:calls=6083066,usec=40126538,usec_per_call=6.60,rejected_calls=1,failed_calls=0 cmdstat_command|docs:calls=2,usec=2338,usec_per_call=1169.00,rejected_calls=0,failed_calls=0 cmdstat_hset:calls=11458624,usec=52228325,usec_per_call=4.56,rejected_calls=464,failed_calls=0 cmdstat_get:calls=1367290,usec=5094801,usec_per_call=3.73,rejected_calls=0,failed_calls=0 cmdstat_info:calls=505270,usec=15588088,usec_per_call=30.85,rejected_calls=0,failed_calls=0 cmdstat_expire:calls=2396157,usec=5957034,usec_per_call=2.49,rejected_calls=0,failed_calls=0 cmdstat_zrangebyscore:calls=378,usec=16973,usec_per_call=44.90,rejected_calls=0,failed_calls=0 cmdstat_pexpireat:calls=2195927,usec=5326262,usec_per_call=2.43,rejected_calls=0,failed_calls=0 cmdstat_psync:calls=56,usec=5506,usec_per_call=98.32,rejected_calls=0,failed_calls=0 cmdstat_hget:calls=288930367,usec=400189671,usec_per_call=1.39,rejected_calls=698,failed_calls=0 cmdstat_client|kill:calls=106,usec=13440,usec_per_call=126.79,rejected_calls=0,failed_calls=0 cmdstat_client|setname:calls=7064,usec=11129,usec_per_call=1.58,rejected_calls=0,failed_calls=0 cmdstat_subscribe:calls=479,usec=1437,usec_per_call=3.00,rejected_calls=0,failed_calls=0 cmdstat_config|get:calls=6159,usec=1207224,usec_per_call=196.01,rejected_calls=0,failed_calls=0 cmdstat_config|rewrite:calls=53,usec=328473,usec_per_call=6197.60,rejected_calls=0,failed_calls=0 cmdstat_exec:calls=53,usec=354013,usec_per_call=6679.49,rejected_calls=0,failed_calls=0 cmdstat_hdel:calls=45600,usec=270022,usec_per_call=5.92,rejected_calls=9,failed_calls=0 cmdstat_publish:calls=408466,usec=2801095,usec_per_call=6.86,rejected_calls=0,failed_calls=0 cmdstat_ping:calls=964943,usec=974662,usec_per_call=1.01,rejected_calls=3084,failed_calls=0 cmdstat_slaveof:calls=53,usec=11239,usec_per_call=212.06,rejected_calls=0,failed_calls=0 cmdstat_del:calls=2986128,usec=54428830,usec_per_call=18.23,rejected_calls=15,failed_calls=0 cmdstat_multi:calls=53,usec=86,usec_per_call=1.62,rejected_calls=0,failed_calls=0 cmdstat_select:calls=27,usec=28,usec_per_call=1.04,rejected_calls=0,failed_calls=0 cmdstat_hmset:calls=54660,usec=357192,usec_per_call=6.53,rejected_calls=0,failed_calls=0 cmdstat_quit:calls=299313,usec=101885,usec_per_call=0.34,rejected_calls=0,failed_calls=0 cmdstat_auth:calls=3267544,usec=11984532,usec_per_call=3.67,rejected_calls=0,failed_calls=0 cmdstat_hkeys:calls=431,usec=235724088,usec_per_call=546923.69,rejected_calls=0,failed_calls=0 cmdstat_latency|histogram:calls=6159,usec=3902199,usec_per_call=633.58,rejected_calls=0,failed_calls=0 cmdstat_latency|latest:calls=6161,usec=29069,usec_per_call=4.72,rejected_calls=0,failed_calls=0 cmdstat_slowlog:calls=0,usec=0,usec_per_call=0.00,rejected_calls=1,failed_calls=0 cmdstat_slowlog|len:calls=6159,usec=10245,usec_per_call=1.66,rejected_calls=0,failed_calls=0 cmdstat_slowlog|get:calls=6161,usec=29664,usec_per_call=4.81,rejected_calls=0,failed_calls=0
Debug
eventloop_duration_aof_sum:8361624 eventloop_duration_cron_sum:382410222 eventloop_duration_max:37163573 eventloop_cmd_per_cycle_max:901
**LATENCY LATEST on Master**
127.0.0.1:6379> LATENCY LATEST 1) 1) "fork" 2) (integer) 1705629026 3) (integer) 137 4) (integer) 148 2) 1) "command" 2) (integer) 1705654209 3) (integer) 8548 4) (integer) 13088
**The slowlog show the request made by the cron at the 50minutes mark**
127.0.0.1:6379> SLOWLOG GET 5 1) 1) (integer) 31 2) (integer) 1705654209 3) (integer) 8548895 4) 1) "HKEYS" 2) "60cc7ee13996d44df8cfd013352532f7" 5) "10.98.241.46:53394" 6) "" 2) 1) (integer) 30 2) (integer) 1705647009 3) (integer) 8588949 4) 1) "HKEYS" 2) "60cc7ee13996d44df8cfd013352532f7" 5) "10.98.241.44:37048" 6) "" 3) 1) (integer) 29 2) (integer) 1705639809 3) (integer) 8491836 4) 1) "HKEYS" 2) "60cc7ee13996d44df8cfd013352532f7" 5) "10.98.241.45:32950" 6) "" 4) 1) (integer) 28 2) (integer) 1705632610 3) (integer) 9249923 4) 1) "HKEYS" 2) "60cc7ee13996d44df8cfd013352532f7" 5) "10.98.241.45:39032" 6) "" 5) 1) (integer) 27 2) (integer) 1705625410 3) (integer) 9683489 4) 1) "HKEYS" 2) "60cc7ee13996d44df8cfd013352532f7" 5) "10.98.241.45:48074" 6) ""
Please advise if you need more information
Thanks in advance for your help
**Comment From: Choogster1**
Have you had any progress with solving this as we're having exactly the same problem on our system
**Comment From: sundb**
@olivierbeytrison you can use module api `RM_Yeild` time to time in your cache_cron_task to avoid being reconized as down.
**Comment From: Choogster1**
@sundb Can you explain exactly how to do that, as I can't find anything about it online? do you have example code?
**Comment From: sundb**
@Choogster1 call RedisModule_Yield() in the mid of heavy job in your module.
please also ref to the module test: https://github.com/redis/redis/blob/fa46aa4d85444c32de909dacfdf47c153537bd70/tests/modules/datatype.c#L38-L45
**Comment From: Choogster1**
@sundb there is no way of modifying the cron job in moodle in that way
**Comment From: sundb**
@Choogster1 please elaborate your demand, or give more specific information.
**Comment From: Choogster1**
@sundb the cache_cron_task is built into moodle and runs as a scheduled task within the system, there is no way of changing any of the code for the cache_cron_task and moodle supports redis as 'officially supported' which means it can't be updated until a later moodle build
**Comment From: sundb**
@Choogster1 which module are you using?
**Comment From: olivierbeytrison**
I traced the php code in moodle from this cron task.
It goes through this helper : https://github.com/moodle/moodle/blob/main/cache/classes/helper.php#L737
which, for a redis cache store, will result in using those functions : https://github.com/moodle/moodle/blob/main/cache/stores/redis/lib.php
and this seems sub-optimal because it will search all keys in the redis server then see if it match the session prefix then see if lastaccess is older than the timeout then delete the keys.
but still, this shouldn't trigger à failover by sentinel.
**Comment From: sundb**
@Choogster1 @olivierbeytrison ohh, sorry, i saw `moodle` as `module`.
`HKEYS` command consume 142 seconds (8548895 microseconds), which will make the master down.
1) 1) (integer) 31 2) (integer) 1705654209 3) (integer) 8548895 4) 1) "HKEYS" 2) "60cc7ee13996d44df8cfd013352532f7" 5) "10.98.241.46:53394"
seems related to this php code:
// Get all of the keys. $keys = $store->find_by_prefix(cache_session::KEY_PREFIX); ```
Comment From: olivierbeytrison
yeah I guess I need to split the session redis cache store from the global cache store. both user DB 0, and that DB contains 30k keys for over 5GB of data.
If the sessions are stored in their dedicated DB then the function would perform a lot faster.
I'll still open an issue on Moodle's repo and link it to here because there's something to do on their side too I guess.
Comment From: sundb
@olivierbeytrison i suggest you to make a issue in moodle which should use HSCAN instead of HKEYS.
Comment From: olivierbeytrison
Thanks @sundb for your help.
Reported to the moodle team : https://tracker.moodle.org/browse/MDL-82511