Crash report
system info:
Docker deployment+playbook
docker version: 5.0.6
node conf
Redis configuration file templates.
bind 10.240.60.125 127.0.0.1
protected-mode no
port 7100
tcp-backlog 511
timeout 0
tcp-keepalive 60
daemonize yes
supervised no
pidfile /data/br/base/redis/redis-logs/7100.pid
loglevel notice
logfile "/data/br/base/redis/logs/7100.log"
databases 16
always-show-logo yes
save 900 1 save 300 10 save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump7100.rdb
dir /data/br/base/redis/redis-logs/
masterauth bonree_ten_years
replica-serve-stale-data yes
replica-read-only yes
repl-diskless-sync no
repl-diskless-sync-delay 5
repl-disable-tcp-nodelay no
replica-priority 100
requirepass bonree_ten_years
maxclients 65535
maxmemory 4000mb
maxmemory-policy volatile-ttl
maxmemory-samples 100000
lazyfree-lazy-eviction no lazyfree-lazy-expire no lazyfree-lazy-server-del no replica-lazy-flush no
appendonly no
appendfilename "appendonly7100.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 80-100 auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
aof-use-rdb-preamble no
lua-time-limit 5000
cluster-enabled yes
cluster-config-file /data/br/base/redis/redis-logs/nodes-7100.conf
cluster-node-timeout 15000
cluster-migration-barrier 3
cluster-require-full-coverage no
slowlog-log-slower-than 500000
slowlog-max-len 128
latency-monitor-threshold 0
notify-keyspace-events ""
hash-max-ziplist-entries 512 hash-max-ziplist-value 64
list-max-ziplist-entries 512 list-max-ziplist-value 64
list-compress-depth 0
set-max-intset-entries 512
zset-max-ziplist-entries 128 zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
stream-node-max-bytes 4096 stream-node-max-entries 100
activerehashing yes
client-output-buffer-limit normal 0 0 0 client-output-buffer-limit replica 0 512mb 0 client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
dynamic-hz yes
aof-rewrite-incremental-fsync yes
rdb-save-incremental-fsync yes
rename-command KEYS "BR-KEYS"
rename-command FLUSHALL "BR-FLUSHALL"
rename-command CONFIG "BR-CONFIG"
rename-command FLUSHDB "BR-FLUSHDB"
loadmodule /data/br/base/redis/rebloom.so
masterauth Removed requirepass Removed
slave-serve-stale-data yes slave-read-only yes repl-timeout 1800 repl-backlog-size 300mb slave-priority 100
log 1:
18:C 06 Nov 2020 06:43:30.069 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 18:C 06 Nov 2020 06:43:30.069 # Redis version=5.0.6, bits=64, commit=00000000, modified=0, pid=18, just started 18:C 06 Nov 2020 06:43:30.069 # Configuration loaded 19:M 06 Nov 2020 06:43:30.073 # You requested maxclients of 65535 requiring at least 65567 max file descriptors. 19:M 06 Nov 2020 06:43:30.074 # Server can't set maximum open files to 65567 because of OS error: Operation not permitted. 19:M 06 Nov 2020 06:43:30.074 # Current maximum open files is 65536. maxclients has been reduced to 65504 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
log2
19:M 23 Jul 2021 12:45:41.411 # Cluster state changed: fail 19:M 23 Jul 2021 12:45:41.598 # Failover auth denied to 2eca42122ffa83474b053fc3dfd7512b5c77f10c: its master is up 19:M 23 Jul 2021 12:45:41.598 # Configuration change detected. Reconfiguring myself as a replica of 2eca42122ffa83474b053fc3dfd7512b5c77f10c 19:S 23 Jul 2021 12:45:41.617 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer. 19:S 23 Jul 2021 12:45:41.617 # Address updated for node 2eca42122ffa83474b053fc3dfd7512b5c77f10c, now ?:7200 19:S 23 Jul 2021 12:45:41.620 # Address updated for node 780349291fb0862e78496aaaa29fa086bd5bb2d9, now ?:7100 19:S 23 Jul 2021 12:45:41.622 # Address updated for node 67c2aaf1e7a3b90dc0860dde521a8de338689615, now ?:7200 19:S 23 Jul 2021 12:45:47.290 * Connecting to MASTER ?:7200
A large number of log
19:S 23 Jul 2021 12:45:47.290 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:45:47.397 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:45:49.144 # Cluster state changed: ok 19:S 23 Jul 2021 12:45:49.807 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:45:49.808 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:45:50.813 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:45:50.813 # Unable to connect to MASTER: Invalid argument
Other log
19:S 23 Jul 2021 12:59:51.518 * FAIL message received from 67c2aaf1e7a3b90dc0860dde521a8de338689615 about 2eca42122ffa83474b053fc3dfd7512b5c77f10c
19:S 24 Jul 2021 02:07:38.540 * FAIL message received from 67c2aaf1e7a3b90dc0860dde521a8de338689615 about 2eca42122ffa83474b053fc3dfd7512b5c77f10c
19:S 24 Jul 2021 02:16:50.034 * FAIL message received from 67c2aaf1e7a3b90dc0860dde521a8de338689615 about 2eca42122ffa83474b053fc3dfd7512b5c77f10c
19:S 27 Jul 2021 13:41:07.311 * FAIL message received from 67c2aaf1e7a3b90dc0860dde521a8de338689615 about 2eca42122ffa83474b053fc3dfd7512b5c77f10c
19:S 24 Jul 2021 02:27:58.057 # Currently unable to failover: Disconnected from master for longer than allowed. Please check the 'cluster-replica-validity-factor' configuration option.
19:S 24 Jul 2021 02:29:04.314 * Clear FAIL state for node 2eca42122ffa83474b053fc3dfd7512b5c77f10c: is reachable again and nobody is serving its slots after some time.
19:M 25 Jun 2021 08:34:33.683 * Replica 10.240.60.121:7100 asks for synchronization 19:M 25 Jun 2021 08:34:33.684 * Partial resynchronization request from 10.240.60.121:7100 accepted. Sending 85298 bytes of backlog starting from offset 526836730405. 19:M 25 Jun 2021 08:34:34.169 # Connection with replica client id #4242186 lost. 19:M 25 Jun 2021 08:34:34.267 * Replica 10.240.60.121:7100 asks for synchronization 19:M 25 Jun 2021 08:34:34.267 * Unable to partial resync with replica 10.240.60.121:7100 for lack of backlog (Replica request was: 493082710443). 19:M 25 Jun 2021 08:34:34.267 * Can't attach the replica to the current BGSAVE. Waiting for next BGSAVE for SYNC 19:M 25 Jun 2021 08:34:34.687 * Replica 10.240.60.121:7100 asks for synchronization 19:M 25 Jun 2021 08:34:34.687 * Partial resynchronization request from 10.240.60.121:7100 accepted. Sending 36586 bytes of backlog starting from offset 526836849054. 19:M 25 Jun 2021 08:34:35.093 # Connection with replica client id #4242187 lost. 19:M 25 Jun 2021 08:34:35.172 # Connection with replica client id #4242188 lost. 19:M 25 Jun 2021 08:34:36.694 * Replica 10.240.60.121:7100 asks for synchronization 19:M 25 Jun 2021 08:34:36.695 * Partial resynchronization request from 10.240.60.121:7100 accepted. Sending 112810 bytes of backlog starting from offset 526836912307. 19:M 25 Jun 2021 08:34:37.183 # Connection with replica client id #4242189 lost. 19:M 25 Jun 2021 08:34:38.703 * Replica 10.240.60.121:7100 asks for synchronization
19:M 25 Jun 2021 08:34:30.221 # Connection with replica client id #2670910 lost. 19:M 25 Jun 2021 08:34:31.381 * Replica 10.240.60.121:7200 asks for synchronization 19:M 25 Jun 2021 08:34:31.383 * Partial resynchronization request from 10.240.60.121:7200 accepted. Sending 264412 bytes of backlog starting from offset 542537658706. 19:M 25 Jun 2021 08:34:31.460 # Connection with replica client id #2670911 lost. 19:M 25 Jun 2021 08:34:32.474 * Replica 10.240.60.121:7200 asks for synchronization 19:M 25 Jun 2021 08:34:32.474 * Unable to partial resync with replica 10.240.60.121:7200 for lack of backlog (Replica request was: 509196026344). 19:M 25 Jun 2021 08:34:32.474 * Can't attach the replica to the current BGSAVE. Waiting for next BGSAVE for SYNC 19:M 25 Jun 2021 08:34:33.386 * Replica 10.240.60.121:7200 asks for synchronization 19:M 25 Jun 2021 08:34:33.386 * Partial resynchronization request from 10.240.60.121:7200 accepted. Sending 159738 bytes of backlog starting from offset 542537925502. 19:M 25 Jun 2021 08:34:33.471 # Connection with replica client id #2670913 lost. 19:M 25 Jun 2021 08:34:34.535 # Connection with replica client id #2670912 lost. 19:M 25 Jun 2021 08:34:35.486 * Replica 10.240.60.121:7200 asks for synchronization 19:M 25 Jun 2021 08:34:35.487 * Partial resynchronization request from 10.240.60.121:7200 accepted. Sending 162141 bytes of backlog starting from offset 542538090099. 19:M 25 Jun 2021 08:34:36.180 # Connection with replica client id #2670914 lost.
Frequency of downtime: sometimes during the day, sometimes at night, interval time range of 8-12 hours, but not fixed time
Trouble help check, thanks
Comment From: madolson
Hello. Cluster fail is caused when part of the cluster is no longer up and serving traffic. It seems like in your logs indicate a failover occurred, so that was probably when failure occurred. I'm not sure if there is something else that is specifically broken here. If you can add more detail that would be great!
Comment From: Fanzaijun
1.How to solve IP identification as a question mark?
19:S 23 Jul 2021 12:45:47.290 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:45:47.397 # Unable to connect to MASTER: Invalid argument
- All of these logs, one node printed for days on end。 Does it affect performance?
3.This is the first error
19:M 23 Jul 2021 12:33:22.775 * Background saving started by pid 31417 31417:C 23 Jul 2021 12:34:08.480 * DB saved on disk 31417:C 23 Jul 2021 12:34:08.571 * RDB: 255 MB of memory used by copy-on-write 19:M 23 Jul 2021 12:45:41.410 * Background saving terminated with success 19:M 23 Jul 2021 12:45:41.411 # Cluster state changed: fail 19:M 23 Jul 2021 12:45:41.598 # Failover auth denied to 2eca42122ffa83474b053fc3dfd7512b5c77f10c: its master is up 19:M 23 Jul 2021 12:45:41.598 # Configuration change detected. Reconfiguring myself as a replica of 2eca42122ffa83474b053fc3dfd7512b5c77f10c 19:S 23 Jul 2021 12:45:41.617 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer. 19:S 23 Jul 2021 12:45:41.617 # Address updated for node 2eca42122ffa83474b053fc3dfd7512b5c77f10c, now ?:7200 19:S 23 Jul 2021 12:45:41.620 # Address updated for node 780349291fb0862e78496aaaa29fa086bd5bb2d9, now ?:7100 19:S 23 Jul 2021 12:45:41.622 # Address updated for node 67c2aaf1e7a3b90dc0860dde521a8de338689615, now ?:7200 19:S 23 Jul 2021 12:45:47.290 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:45:47.397 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:45:49.144 # Cluster state changed: ok 19:S 23 Jul 2021 12:45:49.807 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:45:49.808 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:45:50.813 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:45:50.813 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:45:51.816 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:45:51.817 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:45:54.338 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:45:54.338 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:45:59.813 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:45:59.813 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:00.818 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:00.818 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:01.824 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:01.824 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:02.828 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:02.829 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:03.833 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:03.834 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:04.838 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:04.839 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:08.934 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:08.934 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:09.938 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:09.938 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:10.943 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:10.944 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:11.948 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:11.949 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:12.952 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:12.952 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:13.956 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:13.957 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:16.967 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:16.967 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:17.973 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:17.973 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:18.977 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:18.978 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:19.982 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:19.983 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:20.988 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:20.988 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:24.179 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:24.179 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:25.183 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:25.183 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:26.186 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:26.186 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:27.189 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:27.189 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:28.193 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:28.193 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:29.198 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:29.198 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:30.202 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:30.202 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:31.209 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:31.209 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:32.214 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:32.214 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:33.219 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:33.219 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:36.393 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:36.393 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:37.397 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:37.398 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:38.399 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:38.399 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:39.404 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:39.404 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:40.407 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:40.407 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:41.409 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:41.409 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:42.413 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:42.413 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:43.417 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:43.417 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:44.419 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:44.420 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:45.424 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:45.424 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:46.428 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:46.428 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:47.432 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:47.433 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:48.436 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:48.436 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:49.441 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:49.441 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:50.446 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:50.446 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:51.449 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:51.449 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:52.452 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:52.452 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:53.455 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:53.456 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:54.459 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:54.460 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:55.465 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:55.465 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:56.466 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:56.466 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:57.470 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:57.470 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:58.473 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:58.473 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:46:59.476 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:46:59.477 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:00.478 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:00.478 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:01.482 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:01.482 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:02.487 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:02.487 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:03.491 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:03.491 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:04.494 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:04.494 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:05.499 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:05.499 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:06.502 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:06.502 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:07.505 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:07.506 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:08.508 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:08.508 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:09.513 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:09.513 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:10.517 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:10.518 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:11.523 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:11.524 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:12.530 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:12.530 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:13.535 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:13.536 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:14.539 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:14.540 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:15.544 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:15.544 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:16.549 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:16.549 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:17.553 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:17.553 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:18.558 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:18.558 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:19.563 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:19.563 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:20.568 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:20.568 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:21.574 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:21.574 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:22.577 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:22.577 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:23.581 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:23.582 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:24.587 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:24.587 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:25.591 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:25.591 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:26.595 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:26.595 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:27.601 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:27.602 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:28.607 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:28.607 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:29.612 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:29.612 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:30.616 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:30.616 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:31.621 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:31.621 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:32.625 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:32.625 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:33.629 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:33.629 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:34.634 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:34.635 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:35.640 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:35.640 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:36.644 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:36.644 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:37.647 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:37.648 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:38.652 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:38.653 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:39.655 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:39.655 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:40.658 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:40.658 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:41.663 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:41.663 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:42.668 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:42.668 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:43.670 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:43.671 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:44.673 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:44.673 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:45.675 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:45.675 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:46.679 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:46.679 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:47.684 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:47.684 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:48.688 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:48.689 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:49.692 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:49.692 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:50.696 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:50.696 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:51.701 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:51.702 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:52.706 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:52.706 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:53.711 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:53.711 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:54.714 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:54.714 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:55.717 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:55.718 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:56.722 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:56.723 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:57.726 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:57.727 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:58.731 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:58.731 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:47:59.735 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:47:59.736 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:00.739 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:00.739 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:01.743 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:01.744 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:02.751 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:02.751 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:03.755 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:03.756 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:04.760 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:04.760 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:05.763 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:05.763 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:06.766 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:06.766 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:07.770 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:07.770 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:08.777 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:08.777 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:09.782 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:09.782 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:10.789 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:10.789 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:11.794 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:11.795 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:12.798 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:12.799 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:13.801 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:13.802 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:14.804 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:14.804 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:15.806 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:15.806 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:16.808 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:16.808 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:17.810 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:17.810 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:18.813 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:18.813 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:19.816 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:19.816 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:20.821 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:20.821 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:21.823 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:21.823 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:22.824 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:22.825 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:23.828 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:23.828 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:24.831 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:24.832 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:25.835 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:25.835 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:26.841 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:26.841 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:27.844 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:27.844 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:28.852 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:28.852 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:29.856 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:29.856 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:30.860 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:30.860 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:31.865 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:31.865 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:32.870 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:32.870 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:33.873 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:33.873 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:34.878 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:34.878 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:35.882 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:35.882 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:36.887 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:36.888 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:37.892 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:37.892 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:38.896 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:38.896 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:39.900 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:39.900 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:40.903 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:40.903 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:41.906 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:41.906 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:42.911 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:42.911 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:43.914 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:43.915 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:44.919 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:44.919 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:45.923 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:45.923 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:46.927 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:46.927 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:47.931 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:47.931 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:48.936 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:48:48.936 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:48:49.313 * FAIL message received from a8af1b3a1ecc81302f8566d98b1fe41fb85026f7 about 2eca42122ffa83474b053fc3dfd7512b5c77f10c 19:S 23 Jul 2021 12:49:10.028 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:49:10.029 # Unable to connect to MASTER: Invalid argument 19:S 23 Jul 2021 12:49:11.032 * Connecting to MASTER ?:7200 19:S 23 Jul 2021 12:49:11.032 # Unable to connect to MASTER: Invalid argument
Comment From: Fanzaijun
This is a failover
9:S 23 Jul 2021 12:31:48.116 * Background saving started by pid 4843 19:S 23 Jul 2021 12:32:28.502 * FAIL message received from 314eb29f08b44f86fc7c7f820f783f1504421570 about 827e0ffad73bf0a41f7ee7513a02e01b951230e9 19:S 23 Jul 2021 12:32:28.571 # Start of election delayed for 916 milliseconds (rank #0, offset 709267512123). 19:S 23 Jul 2021 12:32:29.580 # Starting a failover election for epoch 14. 19:S 23 Jul 2021 12:32:29.652 # Failover election won: I'm the new master. 19:S 23 Jul 2021 12:32:29.653 # configEpoch set to 14 after successful failover 19:M 23 Jul 2021 12:32:29.653 # Setting secondary replication ID to 4fe2e9d7d5b614cfc5a7b5fc70ef3e99696a730e, valid up to offset: 709267516065. New replication ID is 334092581116a0d94d9e63f33e77540377eb0243 19:M 23 Jul 2021 12:32:29.653 # Connection with master lost. 19:M 23 Jul 2021 12:32:29.653 * Caching the disconnected master state. 19:M 23 Jul 2021 12:32:29.653 * Discarding previously cached master state. 4843:C 23 Jul 2021 12:32:31.309 * DB saved on disk 4843:C 23 Jul 2021 12:32:31.382 * RDB: 331 MB of memory used by copy-on-write 19:M 23 Jul 2021 12:32:31.580 * Background saving terminated with success
Comment From: Fanzaijun
@madolson Can you help me analyze it carefully? What information do you need? Trouble help check, thanks !
Comment From: madolson
Hello, so this is interesting:
19:S 23 Jul 2021 12:45:41.617 # Address updated for node 2eca42122ffa83474b053fc3dfd7512b5c77f10c, now ?:7200
19:S 23 Jul 2021 12:45:41.620 # Address updated for node 780349291fb0862e78496aaaa29fa086bd5bb2d9, now ?:7100
19:S 23 Jul 2021 12:45:41.622 # Address updated for node 67c2aaf1e7a3b90dc0860dde521a8de338689615, now ?:7200
It seems like everyone decided that they now have a new port. Do you guys use cluster announce IPs by any chance (I didn't see it in the configs, but thought I would double check)?
Other than that, it seems like there might be some type of routing issue where everyone was no longer to discover the IP addresses to talk to each other. Would you mind briefly describing your networking setup?
Also, I removed the passwords from your post, just in case.
Comment From: Fanzaijun
@madolson Thanks!
Announce IPs is not used.
No other configuration.
Do I need to set it up? However, it has not been set before the failure, and the cluster is not a problem?
Another question:
log:
7622:C 28 Jul 2021 13:45:55.213 * RDB: 2013 MB of memory used by copy-on-write
is it normal that this COW is so big?
Now the cluster is failing every day,
The IP address is updated after the crash
Trouble help check, thanks !
Comment From: madolson
The CoW size seems okay to me. You don't need to set the announce IP, it just might have caused the issue.
Still want to follow up about how your cluster is set up. Clusters learn about their own IP based on what others connect to, so it seems like somehow that process is getting messed up. So it would be helpful to understand if there are are load balancers or some other types of routers in your setup.
Comment From: madolson
@ny0312 You resolved an issue where the ? was being set, did we open source that or was that just an AWS issue?
Comment From: bjosv
@madolson We have seen similar issues where we have some kind of network problems and get logs like:
15:S 27 Apr 2022 00:38:49.703 # Cluster state changed: fail
15:S 27 Apr 2022 00:38:49.999 # Address updated for node f55db6b9314afec8816cbb258cf8a8358ad1718c, now ?:6380
15:S 27 Apr 2022 00:38:49.999 # Address updated for node c16a99900ec94d8bfd4dc4e1b0c76340f7e853a6, now ?:6380
15:S 27 Apr 2022 00:38:49.999 # Address updated for node 530b194579251e73473b6c9f94b2ce79ab1ad7eb, now ?:6380
We are running Redis 6.2, but not on AWS.
After looking a bit at the code to see why we suddenly would get ? I have seen that
https://github.com/redis/redis/blob/ff3a3577f29a07e965b3b38d07894410d28da499/src/cluster.c#L1780
is used to get the new IP, either from the header or via the fd.
Since we are not using cluster-announce-ip it would attempt to get the name via getpeername()
That is: nodeIp2String() calls connPeerToString() which calls anetFdToString() and if getpeername() fails to get the name for some reason, the ip will be set in the error case:
https://github.com/redis/redis/blob/ff3a3577f29a07e965b3b38d07894410d28da499/src/anet.c#L607-L611
Maybe there is a need to forward the return code from connPeerToString() and check for failure before updating the node->ip?
https://github.com/redis/redis/blob/ff3a3577f29a07e965b3b38d07894410d28da499/src/cluster.c#L1743-L1750
I'm not sure about the root cause why getpeername() would fail here, perhaps the fd is invalid somehow or maybe the link was quickly shut down again (i.e. link->conn==NULL, using fd=-1),
We also see logs that there are attempts to connect using the faulty ip, like: Connecting to MASTER ?:6380
Any ideas?
Comment From: uvletter
I think we could log the error in nodeIp2String(), and forward the return value outside. In nodeUpdateAddressIfNeeded() it's ok to skip the update when nodeIp2String() fails, as in the next round of ping/pong this routine will retry
Comment From: pieturin
We did get the same crash in some of our instances in AWS.
@ny0312 was able to identify the root cause:
Symptom
This issue manifests with 4 symptoms.
1. This the most direct symptom.
Replicas in cluster mode enabled clusters and repeatedly failing to sync with their masters with the same error message.
5758:S 27 Sep 2021 08:16:03.511 * Connecting to MASTER ?:6379
5758:S 27 Sep 2021 08:16:03.512 # Unable to connect to MASTER: No such file or directory
2. Error messages in 1. all started after a cluster bus event similar to this:
5758:S 27 Sep 2021 08:12:01.356 # Address updated for node bc5d7d5a9f231236e1670523fd783fa9a32e2e74, now ?:6379
where bc5d7d5a9f231236e1670523fd783fa9a32e2e74 is Redis' master.
3. Redis main thread should be completely stuck before the symptom in 2. appears. In the example ticket, Redis was completely stuck when trying to save full sync RDB on AWS EBS volume when EBS was having availability issues (we use EBS to store the RDB file). But Redis could also be stuck for many other reasons.
Take those logs as an example:
5758:M 27 Sep 2021 02:52:52.819 # Configuration change detected. Reconfiguring myself as a replica of bc5d7d5a9f231
236e1670523fd783fa9a32e2e74
5758:S 27 Sep 2021 02:52:52.819 # Connection with replica 10.37.0.123:6379 lost.
5758:S 27 Sep 2021 02:52:52.819 * Before turning into a replica, using my master parameters to synthesize a cached
master: I may be able to synchronize with the new master with just a partial transfer.
5758:S 27 Sep 2021 02:52:53.439 * Connecting to MASTER 10.31.1.219:6379
5758:S 27 Sep 2021 02:52:53.440 * MASTER <-> REPLICA sync started
5758:S 27 Sep 2021 02:52:53.440 * Non blocking connect for SYNC fired the event.
5758:S 27 Sep 2021 02:52:53.441 * Master replied to PING, replication can continue...
...
5758:S 27 Sep 2021 02:53:23.286 * Full resync from master: 5db565a0ddf4b3c8d6eedcbc9aead76fb7400fe2:25270832208423
5758:S 27 Sep 2021 02:53:23.286 * Discarding previously cached master state.
5758:S 27 Sep 2021 02:53:23.342 * MASTER <-> REPLICA sync: receiving streamed RDB from master
5758:S 27 Sep 2021 08:11:59.905 # Address updated for node e0b411af4266e8d07bad7f7ee723da5576ab90af, now ?:6379
5758:S 27 Sep 2021 08:12:00.394 # Address updated for node edcab711cff131c22a1513607ebb0b5c38f061e8, now ?:6379
5758:S 27 Sep 2021 08:12:00.859 # Address updated for node b6b852659c22dae7a488a16f8c82473860b55db0, now ?:6379
5758:S 27 Sep 2021 08:12:01.356 # Address updated for node bc5d7d5a9f231236e1670523fd783fa9a32e2e74, now ?:6379
Couple things to note about the above log messages:
- Redis is replica of
bc5d7d5a9f231236e1670523fd783fa9a32e2e74and had a valid IP of its master around27 Sep 2021 02:52 - Redis went completely radio silent from
27 Sep 2021 02:53:23until27 Sep 2021 08:11:59. - Right after revival, Redis updated its master's IP to ? while processing cluster bus message. (at
27 Sep 2021 08:12:01.356)
On its peers, log messages can be seen:
5816:M 27 Sep 2021 02:53:44.338 * FAIL message received from 6fed0bf1e7e1ad372f41dc426817dfd5755c6d98 about d51e20f752db35be3f00a4253d87173fee33683a
5816:M 27 Sep 2021 08:16:35.291 * Clear FAIL state for node d51e20f752db35be3f00a4253d87173fee33683a: replica is reachable again.
These peer logs indicate Redis wasn't responsive to its peers from 27 Sep 2021 02:53:44 to 27 Sep 2021 08:16:35.291, which exactly matches the time period where it went radio silent.
4. There should be as least one more peer in the cluster other than the problematic replica and its primary.
Root cause
I believe the following sequence of events led to Redis set its master IP to ?:
1. Redis receives cluster bus PING message from its master.
2. Before processing that message, Redis freezes for a long time for whatever reason (e.g. due to AWS EBS storage outage in our case).
3. Redis unfreezes, resume processing that cluster bus PING message from its master from couple hours ago.
4. As part of normal cluster bus PING message processing, Redis tries to see if it needs to update IP addresses of the sender here. Call trace:
* nodeIp2String
* connPeerToString
* anetFdToString
* And eventually some socket APIs such as getpeername, inet_ntop inside anetFdToString
5. Because that cluster bus message was sent couple hours ago, its socket/connection has been long gone. Therefore one of those socket APIs failed, resulting in Redis setting its master's IP to ? here. Because when calling connPeerToString from cluster.c, port argument is not passed, Redis didn't reset its master port to 0 here too. This is why ?:6379 is printed in Redis log messages.
6. As aforementioned, Redis updates peers' IPs all the time based on new incoming cluster bus messages. Once the stuck Redis unfreezes, its primary would reconnect to it and start gossiping to it again. If all goes well, as soon as Redis processes the first new messages from its primary after it has unfrozen, it should be able to correct its view of its primary's IP, changing it from ? back to the valid IP following the same code path described in (4). But there is a race condition where this may not happen:
a. clusterCron is run 10 times per second. It marks a peer as PFAIL if it hasn't been responsive for more than cluster-node-timeout (default to 15s). Once Redis unfreezes, the next time clusterCron run, it will mark its primary as PFAIL because it hasn't processed any message from its primary for a much longer time (because it was stuck itself, not its primary's fault).
b. Now, if Redis receives a cluster bus message from a peer whose gossip section contains an entry about its primary, it will updates its PFAIL primary's IP from ? back to a valid IP, without correcting replication info such as server.masterhost accordingly.
c. From this point on, even if Redis receives a new cluster bus message from its primary, because it already knows its correct IP and address from (6.b), it will never correct its replication info, leaving its server.masterhost as ? for perpetuity.
7. This race condition doesn't always happen. It only happens if (6.a), (6.b), and (6.c) happens exactly in this order. This is probably why this persistent ? symptom didn't happen more during the EBS LSE.
We were able to successfully reproduce this issue locally by simulating these steps (simulate Redis stuck with gdb). In my reproduction, getpeername returned ENOTCONN, which led to master IP being set to ?. After multiple tries, the race condition hit, and Redis left server.masterhost as ? indefinitely. On my linux machine I had to leave the process stuck for 10 minutes in order for the OS to close the socket.