Redis [QUESTION] Shrink AOF file that is being saved for a long time

Hi Redis community :wave:

we have a problem on a server that is running our redis instance.

We have an AOF file format and here are basic information I've been able to get: - the file has ~90GB - the time to load the file again to redis server is several hours (tested on another non-production server with a snapshot of the prod redis server) - the redis-check-aof utility reports error 247, when I run it with newer version than redis 6.0.6 the check passes (however, I don't think this might be the issue) - after the aof file is loaded in the redis server and I run BGREWRITEAOF, it shrinks down to ~250MB (tested on another non-production server) - there seems to be running BGSAVE on RDB (don't know why), check rdb_current_bgsave_time_sec value in log below

We want to shrink the size of aof file, but without the need to restart the whole redis server and loading the whole large aof file (which takes anywhere between 4 to 12 hours).

INFO persistence log:

# Persistence
loading:0
rdb_changes_since_last_save:13513922
rdb_bgsave_in_progress:1
rdb_last_save_time:1601779570
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:2
rdb_current_bgsave_time_sec:51074033
rdb_last_cow_size:2740224
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:1
aof_last_rewrite_time_sec:2
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:5332992
module_fork_in_progress:0
module_fork_last_cow_size:0
aof_current_size:92554403907
aof_base_size:96046017
aof_pending_rewrite:1
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0

I've searched for similar issue here and on stack overflow, but I might have overlooked something.

Thank you for any ideas how to solve this issue :slightly_smiling_face:

Redis version: 6.0.6

Comment From: oranagra

seems like you need to do BGREWRITEAOF or reduce auto-aof-rewrite-percentage config which for some reason doesn't kick in.

i.e. that will start a background process that will write the current content in the ram as a compact AOF file. i'm not sure what you wanted to mention regarding rdb_current_bgsave_time_sec (that's the age of the last RDB saving).

however, maybe i'm completely missing your point, in which case please elaborate.

Comment From: oranagra

please share INFO MEMORY and INFO PERSISTENCE, and CONFIG GET auto-aof* so we can figure out why it didn't do an auto rewrite sooner.

Comment From: filipholec

Hi @oranagra - thanks for your input, here are the stats you requested:

127.0.0.1:6379> INFO MEMORY
# Memory
used_memory:806068920
used_memory_human:768.73M
used_memory_rss:1044463616
used_memory_rss_human:996.08M
used_memory_peak:1253551200
used_memory_peak_human:1.17G
used_memory_peak_perc:64.30%
used_memory_overhead:986617
used_memory_startup:847872
used_memory_dataset:805082303
used_memory_dataset_perc:99.98%
allocator_allocated:810419888
allocator_active:972541952
allocator_resident:1008521216
total_system_memory:8363700224
total_system_memory_human:7.79G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.20
allocator_frag_bytes:162122064
allocator_rss_ratio:1.04
allocator_rss_bytes:35979264
rss_overhead_ratio:1.04
rss_overhead_bytes:35942400
mem_fragmentation_ratio:1.30
mem_fragmentation_bytes:238435720
mem_not_counted_for_evict:1112
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:137377
mem_aof_buffer:1112
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

127.0.0.1:6379> INFO PERSISTENCE
# Persistence
loading:0
rdb_changes_since_last_save:13519325
rdb_bgsave_in_progress:1
rdb_last_save_time:1601779570
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:2
rdb_current_bgsave_time_sec:51090122
rdb_last_cow_size:2740224
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:1
aof_last_rewrite_time_sec:2
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:5332992
module_fork_in_progress:0
module_fork_last_cow_size:0
aof_current_size:92654671472
aof_base_size:96046017
aof_pending_rewrite:1
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0

However, CONFIG GET auto-aof* command does not provide any output in redis-cli: 127.0.0.1:6379> CONFIG GET auto-aof*

(error) ERR unknown command `CONFIG`, with args beginning with: `GET`, `auto-aof*`,

When I try to run BGREWRITEAOF, the output is:

Background append only file rewriting scheduled

But it will never execute (probably because there is still ongoing operation BGSAVE in rdb that is going on for over a year.

Comment From: oranagra

ohh, i didn't notice rdb_bgsave_in_progress:1. so that's what holding the rewrites.

I don't understand why CONFIG GET failed. which version of redis are you using? is it a modified version or a vanilla?

i wonder if you still have the log file from when this bgsave was started and got hung. please try to search for them so that we can figure out why it happened. please also post here INFO ALL or at least INFO REPLICATION, maybe it'll help.

but either way, let's kill it. find its PID with ps aux |grep redis-rdb-bgsave then, let's kill it with SIGSEGV so that it'll print a stack trace to the log file and we can see what it was doing. kill -11 <pid of child>.. please be sure to send it to the child PID (should be harmless) and not the parent or any other process (causing data loss).

this should hopefully cause the AOF rewrites to start working again, and then we'll see if they won't hung on the same issue.

Comment From: filipholec

Great, thanks @oranagra for the comment - after killing the redis-rdb-bgsave process, I've been able to run BGREWRITEAOF :slightly_smiling_face:

Oh and the Redis version is 6.0.6 and since I did kill it the configuration of INFO ALL is now probably not relevant. I couldn't find the problem in logs, since they seem to be redirected to /dev/null.

We can close the issue now.

Comment From: oranagra

@filipholec did you use -11 when you killed it as i advised? in that case we should have a crash log indicating where it was hang.

Comment From: filipholec

@oranagra Yes, I've killed it with -11, but as I've written there is was no crash log present on the filesystem related to redis.

Comment From: oranagra

Ohh, right, you said it was set to dev/null. So I guess we'll never know..