Crash report

From journald (unfortunately all I can get it to collect before the system halts)

BUG: Bad page state in process bio_aof   pfn:2d79ed

Additional information

  1. System: Ubuntu 24.04 LTS (Linux 6.8.0-51-generic)
  2. Repro Info
  3. This has been happening intermittently (twice in the past week) since upgrading from Redis 5.0.7 to Redis 7.2.5.
  4. The server that exhibits this behavior is running 10 redis processes with varying levels of utilization. LOTS of memory, CPU, and disk headroom.
  5. The two system halts this past week both happened during a background AOF rewrite for a Redis process that houses ~50GB of data (maxmemory set to 64GB) and is handling ~25k commands/sec. The base AOF file is sitting at around ~7.3 GB right now but may have been larger at repro time. Logs from the time of the AOF rewrite are at the end of the issue description.
  6. This behavior follows the master instance. Server A and Server B are set up as a replica pair (using sentinel for failover). Earlier in the week this happened on Server A when it was master which caused a failover to Server B. Earlier today it happened on Server B which was now running master.

reclaimFilePageCache? Looking at the code that is involved in background AOF rewrites it looks like posix_fadvise may be getting called on a file descriptor that is no longer valid after this merge: https://github.com/redis/redis/pull/11713/commits/86813a049983d39be63e944a536a9810340d9657 (merging https://github.com/redis/redis/commit/7dae142a2ebf909a63df13e5813c073c79be521f into https://github.com/redis/redis/pull/11713)

Line 257 in bio.c has the comment:

/* The fd may be closed by main thread and reused for another
             * socket, pipe, or file. We just ignore these errno because
             * aof fsync did not really fail. */

But it still goes ahead and calls reclaimFilePageCache on that file descriptor on line 277. Seems suspect but I am not familiar enough to tell if that could have this end result intermittently.


Logs Logs from the previous successful AOF rewrite up to the AOF rewrite where the halt happened (from earlier today):

576683:M 19 Jan 2025 05:43:18.817 * Starting automatic rewriting of AOF on 100% growth
576683:M 19 Jan 2025 05:43:18.818 * Creating AOF incr file redis-6310.aof.5336.incr.aof on background rewrite
576683:M 19 Jan 2025 05:43:20.360 * Background append only file rewriting started by pid 3699957
3699957:C 19 Jan 2025 05:49:21.405 * Successfully created the temporary AOF base file temp-rewriteaof-bg-3699957.aof
3699957:C 19 Jan 2025 05:49:22.771 * Fork CoW for AOF rewrite: current 13079 MB, peak 13079 MB, average 8026 MB
576683:M 19 Jan 2025 05:49:25.533 * Background AOF rewrite terminated with success
576683:M 19 Jan 2025 05:49:25.533 * Successfully renamed the temporary AOF base file temp-rewriteaof-bg-3699957.aof into redis-6310.aof.5336.base.rdb
576683:M 19 Jan 2025 05:49:25.534 * Removing the history file redis-6310.aof.5335.incr.aof in the background
576683:M 19 Jan 2025 05:49:25.534 * Removing the history file redis-6310.aof.5335.base.rdb in the background
576683:M 19 Jan 2025 05:49:25.535 * Background AOF rewrite finished successfully
576683:M 19 Jan 2025 06:20:08.009 * 1 changes in 3600 seconds. Saving...
576683:M 19 Jan 2025 06:20:09.647 * Background saving started by pid 3701951
3701951:C 19 Jan 2025 06:26:13.753 * DB saved on disk
3701951:C 19 Jan 2025 06:26:15.131 * Fork CoW for RDB: current 13445 MB, peak 13445 MB, average 8209 MB
576683:M 19 Jan 2025 06:26:17.930 * Background saving terminated with success
576683:M 19 Jan 2025 06:35:07.558 * Starting automatic rewriting of AOF on 100% growth
576683:M 19 Jan 2025 06:35:07.559 * Creating AOF incr file redis-6310.aof.5337.incr.aof on background rewrite
576683:M 19 Jan 2025 06:35:09.110 * Background append only file rewriting started by pid 3703268
3703268:C 19 Jan 2025 06:41:13.799 * Successfully created the temporary AOF base file temp-rewriteaof-bg-3703268.aof
3703268:C 19 Jan 2025 06:41:15.179 * Fork CoW for AOF rewrite: current 13346 MB, peak 13346 MB, average 8176 MB
576683:M 19 Jan 2025 06:41:17.960 * Background AOF rewrite terminated with success
576683:M 19 Jan 2025 06:41:17.960 * Successfully renamed the temporary AOF base file temp-rewriteaof-bg-3703268.aof into redis-6310.aof.5337.base.rdb
576683:M 19 Jan 2025 06:41:17.961 * Removing the history file redis-6310.aof.5336.incr.aof in the background
576683:M 19 Jan 2025 06:41:17.961 * Removing the history file redis-6310.aof.5336.base.rdb in the background
576683:M 19 Jan 2025 06:41:17.962 * Background AOF rewrite finished successfully
576683:M 19 Jan 2025 07:26:18.093 * 1 changes in 3600 seconds. Saving...
576683:M 19 Jan 2025 07:26:19.679 * Background saving started by pid 3705631
3705631:C 19 Jan 2025 07:30:57.233 * DB saved on disk
3705631:C 19 Jan 2025 07:30:58.590 * Fork CoW for RDB: current 8937 MB, peak 8937 MB, average 4961 MB
576683:M 19 Jan 2025 07:31:01.027 * Background saving terminated with success
576683:M 19 Jan 2025 07:31:01.129 * Starting automatic rewriting of AOF on 108% growth
576683:M 19 Jan 2025 07:31:01.130 * Creating AOF incr file redis-6310.aof.5338.incr.aof on background rewrite
576683:M 19 Jan 2025 07:31:02.491 * Background append only file rewriting started by pid 3705737

Comment From: sundb

@pilor please run redis-server --test-memory 4096 to see if the RAM has a HW problem.

Comment From: pilor

@pilor please run redis-server --test-memory 4096 to see if the RAM has a HW problem.

@sundb redis-server --test-memory 4096 passed successfully three times. I was suspicious about hardware failure as well but the failure follows the master redis process when it migrates to a new box and all tests of the hardware pass.