Redis [NEW] Document that SAVE, BGSAVE can silently produce non-viable RDB dumps

Hi there!

The problem/use-case that the feature addresses

I've been able to reliably reproduce at least one case where at least Redis 6.x and 7.x will consistently+silently produce an RDB dump file that fails redis-check-rdb validation and cannot be used to restore the database.

While I guess this must not be a common occurrence judging by the lack of mentions online, it would be really great to have at least the remote possibility mentioned in the Redis documentation - along with a recommended best practice to ensure that generated dumps are viable.

In my particular case, this behaviour was unexpected and costly - leading to 3 months worth of daily backups being invalid. I was able to salvage the bulk of the data after a couple days of manually processing the .RDB files, so the present issue is only about documentation to help prevent the same occurring to others.

Description of the feature

Mention the possibility (even if it's unlikely) that RDB dump files generated by SAVE and BGSAVE might be non-viable, and suggest a best-practice method for ensuring that a given dump file is viable (e.g. via a successful redis-check-rdb validation if that's sufficient, or maybe a full dump load otherwise).

I'd recommend a mention in at least:

The "Redis persistence" documentation
The SAVE command documentation
The BGSAVE command documentation

Thanks for all the awesome work on Redis! Cheers :-)

Comment From: enjoy-binbin

I've been able to create a reproducible example (using a private dataset, unfortunately) that consistently shows both Redis 6.x and 7.x SAVE and BGSAVE commands producing an RDB dump that fails redis-check-rdb and cannot be loaded.

Hi, by any chance you can share the data? for example, like minimize the data, or privately send us via email

Comment From: oranagra

We are not aware of such a case and we'd obviously prefer to fix the problem rather than document it. Please provide some details as to what was the problem in the rdb file (was it anything other than ziplist integrity check? How did you solve it and if you can help us reproduce it..

If there are any usage patterns that lead to heap corruption, we prefer the details to be reported privately to redis@redis.io due to security concerns.

Comment From: ptaoussanis

@enjoy-binbin @oranagra Hi Binbin, Oran-

Thanks for the quick response 👍 I was interpreting feedback from the Google Group as suggesting that it might in general be necessary to attempt to restore a dump before concluding that it is viable.

If that's not the case, then this issue (for documentation) isn't relevant.

We are not aware of such a case and we'd obviously prefer to fix the problem rather than document it.

Of course. Unfortunately I wasn't able to produce a minimal reproducible when I was initially debugging this, and I can't share the data since there's a lot of sensitive user data in there.

I realise this doesn't give you much to work with, my apologies. Since this anyway seems to be such a rare occurrence, you're welcome to just close this and I'll re-open a new issue focused on the corruption if I'm able to successfully create a minimal repro later. Just swamped for the next few weeks, so may not get an opportunity soon to try.

Please provide some details as to what was the problem in the rdb file

Appeared to be only the Ziplist integrity check, and nothing else:

Internal error in RDB reading offset 0, function at rdb.c:2080 -> Ziplist integrity check failed.

How did you solve it

First I modified this open-source tool to try and skip over invalid Ziplists when converting an RDB to AOF file.
Then I wrote a small tool to parse the resulting AOF file entry-by-entry to look for anything out-of-place - e.g. incorrect bulk counts, etc.
That identified one location that appeared to be malformed.
I manually edited the AOF file to remove the malformed section (~27k bytes).
After that the tool was reporting no problems.
Redis 7's redis-check-aof was still complaining about the AOF manifest, but Redis 6's redis-check-aof was satisfied. Since the DB was originally <7 anyway, I just presumed this has something to do with 7's Multi-Part AOF format and I imported the AOF into a Redis 6 instance.
I then used a small util to relocate the keys to the necessary DBs using MOVE.
I then did a DB dump from Redis 6 to produce a new .RDB file.

The resulting .RDB file thankfully turned out to be good, and appears to have included the vast majority of the original data.

I can't tell for sure how much was lost since I have no healthy backups for comparison, but nothing obvious was missing.

I hope at least some of that info might be helpful in the meantime until if/when I can provide a minimal repro.

Please feel free to close, and thanks again for your time.

Comment From: oranagra

I was interpreting feedback from the Google Group as suggesting that it might in general be necessary to attempt to restore a dump before concluding that it is viable.

well, it is always a good idea to double check (and tripple check) things, but AFAIK there's no specific reason to do that.

Then I wrote a small tool to parse the resulting AOF file entry-by-entry to look for anything out-of-place - e.g. incorrect bulk counts, etc.

you mean that some commands in that AOF were invalid? (i.e. bad protocol that will not be processed). which data types were these? which tool did you use to convert the RDB to AOF?.

in any case, seems like there was some heap corruption on the source, and without being able to reproduce it, we have no hope of finding out the root cause.. let's close this and re-open if more info will be available.

Comment From: ptaoussanis

well, it is always a good idea to double check (and tripple check) things, but AFAIK there's no specific reason to do that.

I'll leave it to you to decide whether it's worth making this suggestion in the docs. My vote would at least be a minor addition to the "Backing up Redis data" section of the "Redis persistence" docs.

These docs already include general best-practice suggestions like moving backups off-site, so I believe this wouldn't be out-of-scope.

you mean that some commands in that AOF were invalid? (i.e. bad protocol that will not be processed). which data types were these? which tool did you use to convert the RDB to AOF?.

I mean that the AOF produced by the tool contained a malformed section, presumably following the skipped entry. This isn't a surprise given that the tool wasn't originally designed to skip bad data. I used this tool, more info is in my previous comment.

Comment From: itamarhaber

I'll leave it to you to decide whether it's worth making this suggestion in the docs. My vote would at least be a minor addition to the "Backing up Redis data" section of the "Redis persistence" docs.

Seconding that vote - please stay tuned.