Redis4.0, cluster mode, aof opened in replica.

Some error happend when replica is doing aof rewrite. I think it's timeout when child progress write pipe message to parent, or read pipe message from parent, but I don't konw how it happened.

Here's some slave logs:

09:17:15.539 * Starting automatic rewriting of AOF on 200% growth 09:17:16.372 * Background append only file rewriting started by pid 115621 09:22:27.490 # Write error writing append only file on disk: Connection timed out 09:23:17.070 # Background AOF rewrite terminated with error 09:23:17.070 # MASTER timeout: no data nor PING received... 09:23:17.070 # Connection with master lost. 09:23:17.071 * Caching the disconnected master state. 09:23:17.071 * Connecting to MASTER 10.58.101.1:6379 09:23:17.071 * MASTER <-> SLAVE sync started 09:23:17.074 * Non blocking connect for SYNC fired the event. 09:23:17.075 * Master replied to PING, replication can continue...

master logs:

09:23:12.023 # Disconnecting timedout slave: 10.58.102.35:6379 09:23:12.023 # Connection with slave 10.58.102.35:6379 lost. 09:23:17.079 * Slave 10.58.102.35:6379 asks for synchronization 09:23:17.086 * Partial resynchronization request from 10.58.102.35:6379 accepted. Sending 34594726 bytes of backlog starting from offset 3571238765911.

slave info

Clients

connected_clients:3 client_longest_output_list:0 client_biggest_input_buf:0 blocked_clients:0

Memory

used_memory:21891413280 used_memory_human:20.39G used_memory_rss:32035143680 used_memory_rss_human:29.84G used_memory_peak:34478016992 used_memory_peak_human:32.11G used_memory_peak_perc:63.49% used_memory_overhead:3969666326 used_memory_startup:6273792 used_memory_dataset:17921746954 used_memory_dataset_perc:81.89% total_system_memory:809998716928 total_system_memory_human:754.37G used_memory_lua:37888 used_memory_lua_human:37.00K maxmemory:34359738368 maxmemory_human:32.00G maxmemory_policy:volatile-lru mem_fragmentation_ratio:1.46 mem_allocator:jemalloc-4.0.3 active_defrag_running:0 lazyfree_pending_objects:0

Persistence

loading:0 rdb_changes_since_last_save:472020780 rdb_bgsave_in_progress:0 rdb_last_save_time:1611632557 rdb_last_bgsave_status:ok rdb_last_bgsave_time_sec:319 rdb_current_bgsave_time_sec:-1 rdb_last_cow_size:2862456832 aof_enabled:1 aof_rewrite_in_progress:0 aof_rewrite_scheduled:0 aof_last_rewrite_time_sec:986 aof_current_rewrite_time_sec:-1 aof_last_bgrewrite_status:ok aof_last_write_status:ok aof_last_cow_size:8377651200 aof_current_size:26906727515 aof_base_size:10904762750 aof_pending_rewrite:0 aof_buffer_length:0 aof_rewrite_buffer_length:0 aof_pending_bio_fsync:0 aof_delayed_fsync:1

Stats

total_connections_received:1003 total_commands_processed:41937880020 instantaneous_ops_per_sec:9353 total_net_input_bytes:3721183801213 total_net_output_bytes:540529173206 instantaneous_input_kbps:758.12 instantaneous_output_kbps:7.99 rejected_connections:0 sync_full:90 sync_partial_ok:0 sync_partial_err:0 expired_keys:0 expired_stale_perc:0.00 expired_time_cap_reached_count:0 evicted_keys:46461 keyspace_hits:0 keyspace_misses:0 pubsub_channels:0 pubsub_patterns:0 latest_fork_usec:865157 migrate_cached_sockets:0 slave_expires_tracked_keys:0 active_defrag_hits:0 active_defrag_misses:0 active_defrag_key_hits:0 active_defrag_key_misses:0

Replication

role:slave master_host:10.58.101.1 master_port:6379 master_link_status:up master_last_io_seconds_ago:0 master_sync_in_progress:0 slave_repl_offset:3720071642543 slave_priority:100 slave_read_only:1 connected_slaves:0 master_replid:ebd48efb316f4c6962a6b7806be55cb55aa9d423 master_replid2:0000000000000000000000000000000000000000 master_repl_offset:3720071642543 second_repl_offset:-1 repl_backlog_active:1 repl_backlog_size:1073741824 repl_backlog_first_byte_offset:3718997900720 repl_backlog_histlen:1073741824

/* Ask the master to stop sending diffs. */
if (write(server.aof_pipe_write_ack_to_parent,"!",1) != 1) goto werr;
if (anetNonBlock(NULL,server.aof_pipe_read_ack_from_parent) != ANET_OK)
    goto werr;
/* We read the ACK from the server using a 10 seconds timeout. Normally
 * it should reply ASAP, but just in case we lose its reply, we are sure
 * the child will eventually get terminated. */
if (syncRead(server.aof_pipe_read_ack_from_parent,&byte,1,5000) != 1 ||
    byte != '!') goto werr;

Comment From: bpo

@anzhiruosu do you have logs from the master around the same time period? Looks like from the replica's perspective the master just stopped responding, as you noted. Could be a network error or possibly the master was locked up in a behavior that prevented it from responding.

Comment From: anzhiruosu

@bpo There's no problem about Logs from master around the same time.

Is the pipe broken between parent and child progress? or the parent progress is very busy and has no time to response?

Comment From: bpo

@anzhiruosu There should at least be a notice in the primary DB's logs indicating when the primary lost its connection to the replica – that might help.

Is the pipe broken between parent and child progress?

It's not a "broken pipe", but a connection timeout. The timeout might be caused by the primary DB not responding, by some intermediary on the network connection swallowing packets and not replying, etc. Your RSS is large so my first guess is a slow fork seizing things up on one side or the other, but others here may have more ideas for you.

The best way to get more help is if you can share even more info – logs, full results of INFO, system specs, where the primary & replica are located, etc.

Comment From: anzhiruosu

@bpo Thanks for your help, I add the master's log.

The err log "09:22:27.490 # Write error writing append only file on disk: Connection timed out" is printed in function rewriteAppendOnlyFile, which runs in the child progress of the slave, So I guess it's about the pipe.

Comment From: bpo

@anzhiruosu thanks for the extra details! Did you share a different instance's info earlier? Previously the RSS that you shared was over 1TB, but the new data in the issue description shows only 21GB.

I think you are probably right that the connection timeout is originating on the pipe used to communicate to the AOF rewrite process. I thought by "broken pipe" you meant sigpipe.

Comment From: anzhiruosu

@bpo hi, previously 1TB RSS is the overall RSS, 21GB in slave info is the shard which has err log. So I still don't konw how it happened about the pipe connection timed out, do you have any ideas?

Comment From: bpo

@anzhiruosu I'm not sure I can help you further right now. Please consider sharing the rest of the output from INFO. Details like specific version, os, etc, can all be useful context that will help people to help you.