I've recently installed redis 3.0.6 in Azure and I've finding an issue where redis is not unlinking AOF files when it does a rewrite, causing a serious issue where it very quickly fills up the disk
... -rw-rw---- 1 redis redis 26807791546 Feb 4 15:17 temp-rewriteaof-34196.aof -rw-rw---- 1 redis redis 24118693105 Feb 4 15:40 temp-rewriteaof-39638.aof -rw-rw---- 1 redis redis 26998498671 Feb 4 16:04 appendonly.aof -rw-rw---- 1 redis redis 20559924126 Feb 4 16:26 temp-rewriteaof-49372.aof -rw-rw---- 1 redis redis 14787871570 Feb 4 16:50 dump.rdb drwxr-xr-x 2 redis redis 4096 Feb 4 16:57 . -rw-rw---- 1 redis redis 6569108251 Feb 4 17:00 temp-rewriteaof-58048.aof ...
... 1516:S 24 Jan 14:05:00.753 * Background saving terminated with success 1516:S 24 Jan 15:00:02.024 * Background saving started by pid 10979 10979:C 24 Jan 15:04:57.031 * DB saved on disk 10979:C 24 Jan 15:04:57.484 * RDB: 3503 MB of memory used by copy-on-write 1516:S 24 Jan 15:04:58.532 * Background saving terminated with success 1516:S 24 Jan 16:00:01.684 * Background saving started by pid 23140 23140:C 24 Jan 16:04:56.223 * DB saved on disk 23140:C 24 Jan 16:04:56.694 * RDB: 3520 MB of memory used by copy-on-write 1516:S 24 Jan 16:04:57.905 * Background saving terminated with success 1516:S 24 Jan 17:00:02.362 * Background saving started by pid 35315 35315:C 24 Jan 17:04:56.993 * DB saved on disk 35315:C 24 Jan 17:04:57.449 * RDB: 3565 MB of memory used by copy-on-write 1516:S 24 Jan 17:04:58.529 * Background saving terminated with success 1516:S 24 Jan 18:00:02.079 * Background saving started by pid 47485 47485:C 24 Jan 18:05:01.446 * DB saved on disk 47485:C 24 Jan 18:05:01.902 * RDB: 2674 MB of memory used by copy-on-write ...skipping... 1516:S 04 Feb 15:00:03.607 * Background append only file rewriting started by pid 34196 1516:S 04 Feb 15:17:20.922 # Background AOF rewrite terminated by signal 9 1516:S 04 Feb 15:17:21.023 * 1000 changes in 600 seconds. Saving... 1516:S 04 Feb 15:17:21.496 * Background saving started by pid 37719 37719:C 04 Feb 15:26:53.148 * DB saved on disk 37719:C 04 Feb 15:26:53.575 * RDB: 5504 MB of memory used by copy-on-write 1516:S 04 Feb 15:26:55.278 * Background saving terminated with success 1516:S 04 Feb 15:26:55.378 * Starting automatic rewriting of AOF on 171% growth 1516:S 04 Feb 15:26:55.857 * Background append only file rewriting started by pid 39638 1516:S 04 Feb 15:40:56.882 # Background AOF rewrite terminated by signal 9 1516:S 04 Feb 15:40:56.983 * 1000 changes in 600 seconds. Saving... 1516:S 04 Feb 15:40:57.485 * Background saving started by pid 42475 42475:C 04 Feb 15:48:01.292 * DB saved on disk 42475:C 04 Feb 15:48:01.784 * RDB: 4868 MB of memory used by copy-on-write 1516:S 04 Feb 15:48:03.372 * Background saving terminated with success 1516:S 04 Feb 15:48:03.472 * Starting automatic rewriting of AOF on 171% growth 1516:S 04 Feb 15:48:03.958 * Background append only file rewriting started by pid 43922 1516:S 04 Feb 16:04:01.313 * AOF rewrite child asks to stop sending diffs. 43922:C 04 Feb 16:04:01.314 * Parent agreed to stop sending diffs. Finalizing AOF... 43922:C 04 Feb 16:04:01.314 * Concatenating 0.00 MB of AOF diff received from parent. 43922:C 04 Feb 16:04:01.315 * SYNC append only file rewrite performed 43922:C 04 Feb 16:04:01.739 * AOF rewrite: 7701 MB of memory used by copy-on-write 1516:S 04 Feb 16:04:03.591 * Background AOF rewrite terminated with success 1516:S 04 Feb 16:04:03.591 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB) 1516:S 04 Feb 16:04:03.591 * Background AOF rewrite finished successfully 1516:S 04 Feb 16:04:03.691 * 1000 changes in 600 seconds. Saving... 1516:S 04 Feb 16:04:04.216 * Background saving started by pid 47188 47188:C 04 Feb 16:14:51.910 * DB saved on disk 47188:C 04 Feb 16:14:52.318 * RDB: 6147 MB of memory used by copy-on-write 1516:S 04 Feb 16:14:53.986 * Background saving terminated with success 1516:S 04 Feb 16:14:54.086 * Starting automatic rewriting of AOF on 171% growth 1516:S 04 Feb 16:14:54.562 * Background append only file rewriting started by pid 49372 1516:S 04 Feb 16:26:39.229 # Background AOF rewrite terminated by signal 9 1516:S 04 Feb 16:26:39.330 * 1000 changes in 600 seconds. Saving... 1516:S 04 Feb 16:26:39.823 * Background saving started by pid 51765 51765:C 04 Feb 16:33:02.874 * DB saved on disk 51765:C 04 Feb 16:33:03.318 * RDB: 4648 MB of memory used by copy-on-write 1516:S 04 Feb 16:33:04.884 * Background saving terminated with success 1516:S 04 Feb 16:33:04.984 * Starting automatic rewriting of AOF on 171% growth 1516:S 04 Feb 16:33:05.510 * Background append only file rewriting started by pid 53091 53091:C 04 Feb 16:39:22.645 # Write error writing append only file on disk: No space left on device 1516:S 04 Feb 16:39:24.620 # Background AOF rewrite terminated with error 1516:S 04 Feb 16:39:24.720 * Starting automatic rewriting of AOF on 171% growth 1516:S 04 Feb 16:39:25.229 * Background append only file rewriting started by pid 54400 54400:C 04 Feb 16:45:46.226 # Write error writing append only file on disk: No space left on device 1516:S 04 Feb 16:45:48.176 # Background AOF rewrite terminated with error 1516:S 04 Feb 16:45:48.276 * 1000 changes in 600 seconds. Saving... 1516:S 04 Feb 16:45:48.757 * Background saving started by pid 55696 55696:C 04 Feb 16:50:38.723 * DB saved on disk 55696:C 04 Feb 16:50:39.200 * RDB: 3074 MB of memory used by copy-on-write 1516:S 04 Feb 16:50:40.510 * Background saving terminated with success 1516:S 04 Feb 16:50:40.610 * Starting automatic rewriting of AOF on 171% growth 1516:S 04 Feb 16:50:41.110 * Background append only file rewriting started by pid 56682 56682:C 04 Feb 16:57:17.960 # Write error writing append only file on disk: No space left on device 1516:S 04 Feb 16:57:20.023 # Background AOF rewrite terminated with error 1516:S 04 Feb 16:57:20.124 * Starting automatic rewriting of AOF on 171% growth 1516:S 04 Feb 16:57:20.615 * Background append only file rewriting started by pid 58048 ...
Nothing is logged to say that an unlink failed.
Comment From: antirez
Sorry we don't offer support for the Windows port since the code base is different. Please report to the Microsoft open source team.
Comment From: antirez
Sorry I was not aware Azure does Linux as well...
So here the issue is that: 1. AOF gets killed by signal 9 by another process, for some reason. Do you have an idea why this happens? 2. The disk is full after some time because of "1".
So the problem appears to be we should cleanup the temp file when the AOF rewriting process fails. However this does not fix, in your specific case, the root issue that appears to be the process is killed by signal 9. The reason why the process is killed in your environment could be the Linux OOM killer.
Comment From: aijanai
Hi, I'm experiencing the same on AWS, Ubuntu 16.04, Redis 3.2.8.
In my case, it's probably due to too much memory allocated to Redis itself, leaving too few for background tasks like AOF and/or dumps. This caused the disk to fill up due to these files being left around.
I'd strong agree with purging these files upon failure. But it would be even better if Redis spared some of the memory dedicated to the cache itself for the housekeeping tasks like the AOF/dump; this way I wouldn't have to bear in mind to leave some memory to other things that are not Redis so that Redis can take care of itself :)
Comment From: jameshartig
We've been having this happen to us. In the most recent case, the temp files were taking up 15gb on the drive within minutes (see timestamps below) and the server ran out of space so all new rewrites were now failing because there was no disk space. While I agree the the "root" of the problem is that the process is getting killed by OOM but even if the memory problem gets fixed, the box will continue to have rewrites fail since it has 0 bytes left on the disk.
Additionally, we've been limiting the redis memory to less than half of the system memory (via maxmemory) and yet we still have redis running out of memory during rewrites. Here is the info from a server when it was getting killed:
# Memory
used_memory:2976142808
used_memory_human:2.77G
used_memory_rss:3131654144
used_memory_rss_human:2.92G
used_memory_peak:2993938240
used_memory_peak_human:2.79G
used_memory_peak_perc:99.41%
used_memory_overhead:223063574
used_memory_startup:1437280
used_memory_dataset:2753079234
used_memory_dataset_perc:92.55%
total_system_memory:3879686144
total_system_memory_human:3.61G
used_memory_lua:228352
used_memory_lua_human:223.00K
maxmemory:1479000000
maxmemory_human:1.38G
maxmemory_policy:volatile-ttl
mem_fragmentation_ratio:1.05
mem_allocator:jemalloc-4.0.3
active_defrag_running:0
lazyfree_pending_objects:0
# Persistence
loading:0
rdb_changes_since_last_save:1848916931
rdb_bgsave_in_progress:0
rdb_last_save_time:1509766804
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:0
aof_enabled:1
aof_rewrite_in_progress:1
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:21
aof_current_rewrite_time_sec:9
aof_last_bgrewrite_status:err
aof_last_write_status:ok
aof_last_cow_size:274317312
aof_current_size:1597469277
aof_base_size:1236816408
aof_pending_rewrite:0
aof_buffer_length:75
aof_rewrite_buffer_length:75
aof_pending_bio_fsync:0
aof_delayed_fsync:0
# Stats
total_connections_received:169124
total_commands_processed:3735043418
instantaneous_ops_per_sec:1975
total_net_input_bytes:397958815893
total_net_output_bytes:908896345
instantaneous_input_kbps:242.86
instantaneous_output_kbps:0.03
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:40062032
keyspace_hits:200962867
keyspace_misses:447936904
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:68183
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
Here's the directory listing mentioned earlier:
-rw-r--r-- 1 redis redis 540791321 Dec 2 15:58 temp-rewriteaof-12007.aof
-rw-r--r-- 1 redis redis 314519788 Dec 2 15:58 temp-rewriteaof-12008.aof
-rw-r--r-- 1 redis redis 417235610 Dec 2 15:58 temp-rewriteaof-12009.aof
-rw-r--r-- 1 redis redis 340607318 Dec 2 15:59 temp-rewriteaof-12010.aof
-rw-r--r-- 1 redis redis 219148429 Dec 2 15:59 temp-rewriteaof-12011.aof
-rw-r--r-- 1 redis redis 588603833 Dec 2 15:59 temp-rewriteaof-12014.aof
-rw-r--r-- 1 redis redis 197111831 Dec 2 15:59 temp-rewriteaof-12104.aof
-rw-r--r-- 1 redis redis 527995194 Dec 2 15:59 temp-rewriteaof-12112.aof
-rw-r--r-- 1 redis redis 154562594 Dec 2 15:59 temp-rewriteaof-12114.aof
-rw-r--r-- 1 redis redis 345768587 Dec 2 16:00 temp-rewriteaof-12115.aof
-rw-r--r-- 1 redis redis 327127190 Dec 2 16:00 temp-rewriteaof-12268.aof
-rw-r--r-- 1 redis redis 366002348 Dec 2 16:00 temp-rewriteaof-12269.aof
-rw-r--r-- 1 redis redis 331579392 Dec 2 16:00 temp-rewriteaof-12270.aof
-rw-r--r-- 1 redis redis 258134207 Dec 2 16:00 temp-rewriteaof-12274.aof
-rw-r--r-- 1 redis redis 422592512 Dec 2 16:00 temp-rewriteaof-12275.aof
-rw-r--r-- 1 redis redis 207372397 Dec 2 16:01 temp-rewriteaof-12276.aof
-rw-r--r-- 1 redis redis 237015321 Dec 2 16:01 temp-rewriteaof-12545.aof
-rw-r--r-- 1 redis redis 196804762 Dec 2 16:01 temp-rewriteaof-12546.aof
-rw-r--r-- 1 redis redis 510542453 Dec 2 16:01 temp-rewriteaof-12547.aof
-rw-r--r-- 1 redis redis 108498971 Dec 2 16:01 temp-rewriteaof-12548.aof
-rw-r--r-- 1 redis redis 412852546 Dec 2 16:01 temp-rewriteaof-12549.aof
-rw-r--r-- 1 redis redis 267894784 Dec 2 16:01 temp-rewriteaof-12550.aof
-rw-r--r-- 1 redis redis 211525974 Dec 2 16:02 temp-rewriteaof-12552.aof
-rw-r--r-- 1 redis redis 330956915 Dec 2 16:02 temp-rewriteaof-12553.aof
-rw-r--r-- 1 redis redis 204767232 Dec 2 16:02 temp-rewriteaof-12896.aof
-rw-r--r-- 1 redis redis 230748322 Dec 2 16:02 temp-rewriteaof-12897.aof
-rw-r--r-- 1 redis redis 377647292 Dec 2 16:02 temp-rewriteaof-12905.aof
-rw-r--r-- 1 redis redis 289579205 Dec 2 16:02 temp-rewriteaof-12906.aof
-rw-r--r-- 1 redis redis 187977952 Dec 2 16:02 temp-rewriteaof-12907.aof
-rw-r--r-- 1 redis redis 432414843 Dec 2 16:02 temp-rewriteaof-12908.aof
-rw-r--r-- 1 redis redis 100585734 Dec 2 16:02 temp-rewriteaof-12911.aof
-rw-r--r-- 1 redis redis 213737585 Dec 2 16:03 temp-rewriteaof-12912.aof
-rw-r--r-- 1 redis redis 233083081 Dec 2 16:03 temp-rewriteaof-12913.aof
-rw-r--r-- 1 redis redis 207515697 Dec 2 16:03 temp-rewriteaof-12914.aof
-rw-r--r-- 1 redis redis 204947486 Dec 2 16:03 temp-rewriteaof-12916.aof
-rw-r--r-- 1 redis redis 300081504 Dec 2 16:03 temp-rewriteaof-12917.aof
-rw-r--r-- 1 redis redis 302199053 Dec 2 16:03 temp-rewriteaof-13427.aof
-rw-r--r-- 1 redis redis 79614075 Dec 2 16:03 temp-rewriteaof-13428.aof
-rw-r--r-- 1 redis redis 233951625 Dec 2 16:03 temp-rewriteaof-13436.aof
-rw-r--r-- 1 redis redis 413949952 Dec 2 16:03 temp-rewriteaof-13437.aof
-rw-r--r-- 1 redis redis 87371819 Dec 2 16:03 temp-rewriteaof-13438.aof
-rw-r--r-- 1 redis redis 177389568 Dec 2 16:04 temp-rewriteaof-13439.aof
-rw-r--r-- 1 redis redis 244347134 Dec 2 16:04 temp-rewriteaof-13440.aof
-rw-r--r-- 1 redis redis 80257156 Dec 2 16:04 temp-rewriteaof-13441.aof
-rw-r--r-- 1 redis redis 221667628 Dec 2 16:04 temp-rewriteaof-13442.aof
-rw-r--r-- 1 redis redis 105742351 Dec 2 16:04 temp-rewriteaof-13443.aof
-rw-r--r-- 1 redis redis 282427474 Dec 2 16:04 temp-rewriteaof-13444.aof
-rw-r--r-- 1 redis redis 262139904 Dec 2 16:04 temp-rewriteaof-14109.aof
-rw-r--r-- 1 redis redis 58089485 Dec 2 16:04 temp-rewriteaof-14111.aof
-rw-r--r-- 1 redis redis 245375123 Dec 2 16:04 temp-rewriteaof-14112.aof
-rw-r--r-- 1 redis redis 342102322 Dec 2 16:04 temp-rewriteaof-14113.aof
-rw-r--r-- 1 redis redis 98598937 Dec 2 16:04 temp-rewriteaof-14115.aof
-rw-r--r-- 1 redis redis 94441550 Dec 2 16:04 temp-rewriteaof-14116.aof
-rw-r--r-- 1 redis redis 192569551 Dec 2 16:05 temp-rewriteaof-14124.aof
-rw-r--r-- 1 redis redis 181596160 Dec 2 16:05 temp-rewriteaof-14125.aof
-rw-r--r-- 1 redis redis 75735040 Dec 2 16:05 temp-rewriteaof-14126.aof
-rw-r--r-- 1 redis redis 211935350 Dec 2 16:05 temp-rewriteaof-14129.aof
-rw-r--r-- 1 redis redis 121929973 Dec 2 16:05 temp-rewriteaof-14933.aof
-rw-r--r-- 1 redis redis 236302501 Dec 2 16:05 temp-rewriteaof-14934.aof
-rw-r--r-- 1 redis redis 265941182 Dec 2 16:05 temp-rewriteaof-14935.aof
-rw-r--r-- 1 redis redis 141680673 Dec 2 16:05 temp-rewriteaof-14936.aof
-rw-r--r-- 1 redis redis 56217665 Dec 2 16:05 temp-rewriteaof-14937.aof
-rw-r--r-- 1 redis redis 164819186 Dec 2 16:05 temp-rewriteaof-14938.aof
-rw-r--r-- 1 redis redis 101142528 Dec 2 16:05 temp-rewriteaof-14943.aof
Comment From: UncleTawnos
we're experiencing the same issue with redis-server 3:3.2.5-2~bpo8+1 debian, proxmox vms
Comment From: step76
We're experiencing the same issue with redis 5.0.5.
Could be #3504 related?
Comment From: Sopamo
We are having the same issue on Azure in a Kubernetes (AKS) environment. The issue happens right after we execute a FLUSHDB command and start writing (a lot of) data afterwards. After the disk is full, redis then failes to start at all. I'm posting the crashlog here, because maybe some of the warnings regarding server settings are relevant here?
==> ** Starting Redis ** 1:C 05 Apr 2020 07:24:00.259 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:C 05 Apr 2020 07:24:00.259 # Redis version=5.0.5, bits=64, commit=00000000, modified=0, pid=1, just started 1:C 05 Apr 2020 07:24:00.259 # Configuration loaded 1:M 05 Apr 2020 07:24:00.260 * Running mode=standalone, port=6379. 1:M 05 Apr 2020 07:24:00.260 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:M 05 Apr 2020 07:24:00.260 # Server initialized 1:M 05 Apr 2020 07:24:00.260 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled. 1:M 05 Apr 2020 07:24:00.261 * Reading RDB preamble from AOF file... 1:M 05 Apr 2020 07:24:09.569 * Reading the remaining AOF tail... 1:M 05 Apr 2020 07:24:18.948 * DB loaded from append only file: 18.688 seconds 1:M 05 Apr 2020 07:24:18.949 * Ready to accept connections 1:M 05 Apr 2020 07:24:19.051 # Short write while writing to the AOF file: (nwritten=41, expected=97) 1:signal-handler (1586071484) Received SIGTERM scheduling shutdown... 1:M 05 Apr 2020 07:24:44.065 # User requested shutdown... 1:M 05 Apr 2020 07:24:44.065 * Calling fsync() on the AOF file. 1:M 05 Apr 2020 07:24:44.073 # Redis is now ready to exit, bye bye...
Comment From: yoav-steinberg
Closing as this was fixed in #5920.