Redis Followup performance testing for per-slot dictionary

Several followup items to potentially improve the performance of https://github.com/redis/redis/pull/11695.

[ ] : https://github.com/redis/redis/pull/11695#discussion_r1292811014). Is this something we believe needs to be done now? [Madelyn votes no] MSET already has improved performance and this enhancement can be taken up in isolation.
[x] : Performance optimization for the index tree to find fair item in O(Log(N)) instead of O(Log^2(N)) (Followup seems okay here?) https://github.com/redis/redis/pull/12704

Comment From: hpatro

@roshkhatri When you publish the PR for perf improvement, please link this.

Comment From: roshkhatri

Yes, I will link this to the PR. Thanks!

Comment From: roshkhatri

https://github.com/redis/redis/pull/11695#discussion_r1292811014). Is this something we believe needs to be done now? [Madelyn votes no] MSET already has improved performance and this enhancement can be taken up in isolation.

I was running some perfs (performance analysis tool in linux) for loading ~80M keys from RDB, in two cases: 1. Keys stored across all slots. (~85M keys) 2. Keys stored in slot 0. (~75M keys) (in this case, we update the most nodes in the Fenwick tree.)

After profiling I found that cumulativeKeyCountAdd does not take up any significant amount of processing time.

Just to be sure, I also compared the time taken to load the rdb file in unstable branch and with the optimization. There was no significant difference, below are the results of a couple of runs: - Unstable branch: DB loaded from disk: 52.791 seconds DB loaded from disk: 51.880 seconds - With optimization: DB loaded from disk: 51.326 seconds DB loaded from disk: 50.469 seconds

I don't think we would see much of a significant improvement in the performance by optimizing the batch update operations on the Fenwick tree.

Comment From: hpatro

Thanks for benchmarking this @roshkhatri.

@madolson @oranagra We don't see much samples for cumulativeKeyCountAdd by running perf while RDB load is ongoing. This also clearly signifies other code flow like MULTI/EXEC / LUA stand close to negligible gain. I think we could close this issue if there is no concern from your side.

Comment From: oranagra

Just to be sure we're discussing the same thing. The update is O(log(slots)), and in an MSET or MULTI it'll be O(N * log(slots)). And we're saying it's negligible. Right?

Comment From: roshkhatri

@oranagra Yes, the update for RDB will also be O(N * log(slots)), where N is the number of keys and it seems to be negligible. I have also tried optimizing it and could not see any different in the RDB loading time.

Comment From: oranagra

last concern before we dismiss it, maybe an RDB is not a good way to test it? maybe that test is disk bound, and optimizations won't be reflected in faster loading? maybe it should be tested with an MGET using memtier_benchmark or alike? even then, it's all too easy to get it wrong, and have the client or network be the bottleneck, but then in addition to throughput, we can measure CPU usage, and maybe that will reflect the difference

Comment From: hpatro

@oranagra Theoretically it will be of course a performance bump but I think it would complicate the logic of not updating it within each command execution based on certain flag (https://github.com/redis/redis/blob/unstable/src/db.c#L281 , https://github.com/redis/redis/blob/unstable/src/db.c#L568). My suggestion would be to close this out as is.

Comment From: oranagra

I think the point is to discuss the complexity after we know the performance impact. @roshkhatri published a test indicating it has no benefit. So just think we should validate that it wasn't a bad test, and then we can decide how to proceed.