When on of the master instances reached max memory is marked in logs/metrics as fail also replicas and clients lost connections and after timeout is reached failover start. All keys has expire set.
maxmemory 5gb
maxmemory-policy volatile-lru
redis 6.0.12
Environment: Ubuntu 18.04.5 & RHEL 8.3
Nothing much in logs, looks same as if node was killed or blocked connection (iptables/nftables)
Cluster state changed: fail
FAIL message received from 339c746e9d1fc2f0e06899d2bc7578566a085ed8 about d7e84063daa67c5de95aa8b089bbb94a76455d1d
Logs from failed master:
Mar 25 10:24:04 s2 redis-6479[6880]: Failover auth denied to 5daf636eeac4ff923e96641d677aed45f2dfef7a: its master is up
Mar 25 10:24:04 s2 redis-6479[6880]: Connection with replica 10.1.1.14:6379 lost.
Mar 25 10:24:04 s2 redis-6479[6880]: Configuration change detected. Reconfiguring myself as a replica of 5daf636eeac4ff923e96641d677aed45f2dfef7a
Mar 25 10:24:04 s2 redis-6479[6880]: Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
Mar 25 10:24:04 s2 redis-6479[6880]: Connecting to MASTER 10.1.1.14:6379
Mar 25 10:24:04 s2 redis-6479[6880]: MASTER \<-> REPLICA sync started
Mar 25 10:24:04 s2 redis-6479[6880]: Non blocking connect for SYNC fired the event.
Mar 25 10:24:04 s2 redis-6479[6880]: Master replied to PING, replication can continue...
Mar 25 10:24:04 s2 redis-6479[6880]: Trying a partial resynchronization (request 13b9dd91588a3a1b81036bc60bd83326ae378285:2540939890827).
Mar 25 10:24:05 s2 redis-6479[6880]: Full resync from master: 07694bc69ca1ee58fccff82fa2a797e295e85775:2540940155658
Mar 25 10:24:05 s2 redis-6479[6880]: Discarding previously cached master state.
Mar 25 10:24:05 s2 redis-6479[6880]: MASTER \<-> REPLICA sync: receiving streamed RDB from master with EOF to disk
Mar 25 10:25:12 s2 redis-6479[6880]: MASTER \<-> REPLICA sync: Flushing old data
Mar 25 10:25:29 s2 redis-6479[6880]: MASTER \<-> REPLICA sync: Loading DB in memory
....
All uncommented from conf:
################################## NETWORK #####################################
bind 10.1.1.23
protected-mode no
port 6379
tcp-backlog 4096
timeout 0
tcp-keepalive 300
################################# GENERAL #####################################
daemonize yes
supervised no
pidfile /var/run/redis-6379/redis-server.pid
loglevel notice
syslog-enabled yes
syslog-ident redis-6379
syslog-facility local0
always-show-logo yes
################################ SNAPSHOTTING ################################
save ""
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
rdb-del-sync-files no
dir /var/lib/redis/6379
################################# REPLICATION #################################
replica-serve-stale-data yes
replica-read-only yes
repl-diskless-sync yes
repl-diskless-sync-delay 0
repl-diskless-load disabled
repl-disable-tcp-nodelay no
repl-backlog-size 32mb
replica-priority 100
############################## MEMORY MANAGEMENT ################################
maxmemory 5gb
maxmemory-policy volatile-lru
############################# LAZY FREEING ####################################
lazyfree-lazy-eviction no
lazyfree-lazy-expire no
lazyfree-lazy-server-del no
replica-lazy-flush no
lazyfree-lazy-user-del no
############################## APPEND ONLY MODE ###############################
appendonly no
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
aof-use-rdb-preamble yes
################################ LUA SCRIPTING ###############################
lua-time-limit 5000
################################ REDIS CLUSTER ###############################
cluster-enabled yes
cluster-config-file nodes-6379.conf
cluster-node-timeout 4000
cluster-replica-validity-factor 0
cluster-migration-barrier 1
################################## SLOW LOG ###################################
slowlog-log-slower-than 10000
slowlog-max-len 128
################################ LATENCY MONITOR ##############################
latency-monitor-threshold 0
############################# EVENT NOTIFICATION ##############################
notify-keyspace-events ""
############################### ADVANCED CONFIG ###############################
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-size -2
list-compress-depth 0
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
stream-node-max-bytes 4096
stream-node-max-entries 100
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit replica 1024mb 256mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
dynamic-hz yes
aof-rewrite-incremental-fsync yes
rdb-save-incremental-fsync yes
########################### ACTIVE DEFRAGMENTATION #######################
jemalloc-bg-thread yes
Comment From: madolson
Is this a consistent failure or did this just happen once? Nothing looks particularly out of place here outside of a failure. Was there anything interesting in the slowlog that might have indicated why the failover took place?
It would be nice to have the logs from the other nodes, the ones that are unable to talk with this master when they believe it has failed.
Comment From: Lathanderjk
looks like it's caused by eviction(DEL) itself with lazyfree-lazy-eviction yes nothing happened.
Comment From: madolson
Do you have large collections in your Redis instance? One possible answer is that it spent more than 4000ms (your timeout) evicting a key. Using the lazy-free options would be a possible solution there. You could try running latency doctor to see if it has some useful data.