Redis [NEW]Scan for hash fields only

The problem/use-case that the feature addresses

Is it possible to scan the hash field only, value is not necessary

Description of the feature

Maybe HKSCAN, scans the hash fields only, not loading the hash value, the value may be large

Alternatives you've considered

Additional information

Comment From: sundb

Why no use HKEY command?

Comment From: WeihanLi

Why no use HKEY command?

There may be too many fields in the hash, the hash fields we needed may be only about 1~10%, so we prefer a pattern-matching way to get the fields, and the key value may be very large, we generally avoid reading unnecessary field value

Comment From: sundb

seems that the existing commands aren't sufficient for your needs, perhaps we can extend HKEY command with pattern matching. like KEY command:

HKEY key [MATCH pattern]

ping @redis/redis-committers, please share your thoughts.

Comment From: WeihanLi

HKEY key [MATCH pattern]

That's also a great candidate, thanks

Comment From: zuiderkwast

It's not good to use HKEYS for very large hashes because it's O(N) just like KEYS.

I prefer HKSCAN. Another possibility is HSCAN with a new argument like NOVALUES.

Comment From: zuiderkwast

We may want the same for ZSCAN, to return the elements without their score.

Comment From: zuiderkwast

Do we want to avoid adding arguments that affect the return type of a command? I don't remember. @guybe7 do you know?

Comment From: oranagra

i think adding a [NOVALUES] argument to HSCAN makes sense.

Comment From: guybe7

@zuiderkwast yes, that is preferable IMHO

Comment From: zuiderkwast

@guybe7 so if we add [NOVALUE] to HSCAN, it changes the return structure. If you want to avoid that, I would think you prefer a separate HKSCAN?

That's my only concern. Personally, I don't mind adding this. We already have a lot of WITHSCORE and other arguments affecting the return value of various commands.

Comment From: madolson

@WeihanLi Can you document the use case you are trying to solve here? We discussed this and weren't entirely sure what the material motivation behind having this type of functionality.

Comment From: WeihanLi

@madolson thanks, in our case, we want to get the key list that matches a specific pattern in the hash, the values in hash may be very large, we want to avoid reading values to avoid unnecessary memory usage.

Comment From: madolson

@WeihanLi Thanks for the high level, some specific followup questions: 1. What are you doing with the keys after you have extracted from the hash? I'm trying to understand if there is a different data model that also solves your use case. 2. Why HSCAN as opposed to adding the filtering on HKEYS (It's discussed earlier but I want to understand if that would be a better fit for you). Are these collections very large? Why isn't consistency important for your specific use case?

Comment From: WeihanLi

I'm building a cache aside framework, which would load data into the memory cache when the app starts and keep up to date with the Redis stream updates, currently, we share the stream for all key updates, when the update received from the Redis stream, we may need to confirm it's in our watch list before accepting the update.

What are you doing with the keys after you have extracted from the hash?

Firstly, we need the keys to load the cache from Redis Hash, there may be multiple patterns so the key list may have duplicates, so we may want to get the key list in a memory-based HashSet.

Secondly, we need the key list to judge whether the update is in our watching key list, if not, we not, we just ignore the update

Why HSCAN as opposed to adding the filtering on HKEYS (It's discussed earlier but I want to understand if that would be a better fit for you). Are these collections very large? Why isn't consistency important for your specific use case?

Yeah, the HashSet is large in our cases, over 40,000 fields each hash, and about 40 * 200 hash. Personally, I may want to propose the HKSCAN one, since it's a new command it should not have compatible issues. But if others work greatly without compatible issues, I have no preference.

Comment From: soloestoy

Note: The time complexity of HKEYS is O(n). When there are a large number of fields, it can lead to slow queries and block Redis.

On our platform, we typically recommend users to replace KEYS/HKEYS with SCAN/HSCAN to avoid slow queries.

Comment From: WeihanLi

Note: The time complexity of HKEYS is O(n). When there are a large number of fields, it can lead to slow queries and block Redis.

Yep, that's why we want a HKScan instead of filtering manually the result of HKeys

Comment From: madolson

Ok, just to make sure I'm reading this right, when an update comes in on the stream, you want to do O(N) amount of work (possibly over multiple calls) to check some value from the update against every key name in the hash, to validate that it is actually being watched. It seems inefficient, but I agree there is no better way to do the partial matching since we have no semi-ordered data structure today.

Comment From: WeihanLi

when an update comes in on the stream, you want to do O(N) amount of work (possibly over multiple calls) to check some value from the update against every key name in the hash, to validate that it is actually being watched

To be more specific, we would read all watched hash field names into an in-memory HashSet when app starts, and the Redis Stream message contains the field name and hash key name to be updated, we expect the validation to be O(1)

Comment From: madolson

when an update comes in on the stream, you want to do O(N) amount of work (possibly over multiple calls) to check some value from the update against every key name in the hash, to validate that it is actually being watched

To be more specific, we would read all watched hash field names into an in-memory HashSet when app starts, and the Redis Stream message contains the field name and hash key name to be updated, we expect the validation to be O(1)

Then I guess I'm still a little confused what the scan is for, since scanning is O(N) for the new of keys in the hash. If you have O(1) validation, what is the O(N) scanning needed for?

Comment From: WeihanLi

what the scan is for, since scanning is O(N) for the new of keys in the hash. If you have O(1) validation, what is the O(N) scanning needed for?

When the app starts, We need the scan to get all matched hash fields into in memory HashSet so that We could Do O(1) matching

Comment From: madolson

When the app starts, We need the scan to get all matched hash fields into in memory HashSet so that We could Do O(1) matching

Got it. So the use case doesn't seem trivial to solve without spending memory and double storing the fields in the hash and a separate set.

One concern I have is that this hash scanning isn't stable across failovers (In that, your application might transparently handle the failover and continue scanning with an incorrect cursor. This might not be a concern for you if you don't have replicas, but I want to make sure we're considering the feature broadly). For this reason I'm not the biggest fan of the SCAN APIs, and would like to make sure we understand alternatives before extending their capabilities. So I would also still prefer HKEYS {pattern}.

We could mitigate the performance impact by implementing the background thread version of HKEYS, that blocks incoming writes but serializes the HKEYS response in a background thread, which would amortize the cost HKEYS and prevent tail latency spikes since other reads would be able to progress.

Comment From: soloestoy

I don't completely agree with your viewpoint. In my opinion, SCAN is a very clever choice, similar to the usage of "LIMIT m, n" in SQL. Moreover, SCAN not only solves the problem of time complexity but also addresses the issue of space complexity. It is important to note that the results of KEYS command need to be cached in the client buffer, and even if the time consumption is offloaded using a background thread, the memory consumption cannot be eliminated, which is also very memory-consuming.

Additionally, considering that hash is an unordered structure, I find the progressive scanning implementation in Redis on unordered structures to be very unique and elegant work. I really appreciate SCAN.

I don't oppose HKEYS <key> [pattern], we can have them both. On one hand, considering the completeness of commands (similar to the pair of KEYS and SCAN commands, it is better to keep the command options consistent). On the other hand, for small hash keys, the HKEYS command is also effective. However, this depends on user discretion, as they can choose different commands based on the size of the keys.

Comment From: WeihanLi

I think SCAN is good for us, HKEYS <key> [pattern] is also great. I may prefer the SCAN since the SCAN has better performance than HKEYS <key> [pattern] likes SCAN vs KEYS? If HKEYS <key> [pattern] added, also hope for the pattern support for KEYS

Comment From: zuiderkwast

@WeihanLi The sum of SCAN calls has the same performance as KEYS, but it is split into multiple calls so it doesn't block Redis for a long time.

@madolson The problem you're referring to is #12440. You have a good solution for this: to let replicas sync the hash function seed. Let's implement it, shall we?

Comment From: madolson

don't completely agree with your viewpoint. In my opinion, SCAN is a very clever choice, similar to the usage of "LIMIT m, n" in SQL.

I don't think I chose my words well. I have a very different opinion of SCAN compared to the 3 scan operations that iterate on a single key (SSCAN, HSCAN, and ZSCAN). My mental model is that keys are intended to be operated on mostly as atomic units, and are stored in such a way that they can be accessed and modified atomically. So, HSCAN feels like an anti-pattern to me, since you are doing a weird partial iteration of a hash that has some odd consistency implications. SCAN doesn't feel like an anti-pattern, since it's more of an upfront expectation that you will get some keys and you take action on them independently, so the ordering and consistency should matter less. So, that's why I'm biased towards extending HKEYS. However, HKSCAN doesn't seem that outrageous to me either.

@WeihanLi The sum of SCAN calls has the same performance as KEYS, but it is split into multiple calls so it doesn't block Redis for a long time.

The cumulative performance of all SCAN commands is actually worse than KEYS, by 1.5-2x typically, because you're constantly losing cache locality, sending a lot more small packets that need to be individually processed, and incurring a lot of overhead with small commands.

@zuiderkwast Yeah, Yossi had some concern which I forget, I think it was related to versioning. I will close on that so we can cross that off as an issue.

Comment From: WeihanLi

Oh, get it, thanks very much for the detailed clarify the SCAN performance @zuiderkwast @madolson

SCAN doesn't block Redis for a long time.

I think it's our preference because we may have a lot of clients about 800+, so we do not want the command block for a long time

Comment From: soloestoy

The difference between the design goals of developers and the actual usage methods of users is an interesting topic, but it will not be further discussed here.

Returning to this issue, the conclusion is to introduce the NOVALUE option for HSCAN and also add the ability to use a [match pattern] for HKEYS. Can we come to an agreement on this?

Comment From: madolson

Returning to this issue, the conclusion is to introduce the NOVALUE option for HSCAN and also add the ability to use a [match pattern] for HKEYS. Can we come to an agreement on this?

The only pending question I have is how serious we are about having a single command return multiple different types of responses. Adding [NOVALUE] wouldn't be strictly changing the return type, it would still be an array, but clients might choose to interpret it differently since the contract will change. As per @guybe7 comment, it's better not to change the return syntax. Which might lead us to introduce the dedicated HKSCAN command just for keys.

I ask because we have these two opposing tenets: 1. Have commands return a single coherent type. 2. Have fewer more flexible commands.

We still don't have a good way to decide between them. I still think 1 is more important than 2, since it's simpler from the client side to decide what to do.

Comment From: soloestoy

Adding [NOVALUE] wouldn't be strictly changing the return type, it would still be an array, but clients might choose to interpret it differently since the contract will change.

I don't understand why this is important, we already have many commands of this type, such as: ZRANGE [WITHSCORES], ZINTER [WITHSCORES], ZUNION [WITHSCORES], and so on.

Comment From: zuiderkwast

Sorry for being late, but I have yet another option.

There is partial compatibility between types: the ZUNION, ZINTER, ZDIFF family of commands accept regular sets in addition to a sorted sets. We could expand this to let set functions (SUNION, etc.) accept sorted sets (ignoring the scores) and even hashes (ignoring the values). The sorted set commands (ZUNION, etc.) could accept hashes if the values are numeric. If we go this way, it would be natural to let SSCAN be done on a hash or a sorted set, returning the keys only.

Comment From: madolson

I don't understand why this is important, we already have many commands of this type, such as: ZRANGE [WITHSCORES], ZINTER [WITHSCORES], ZUNION [WITHSCORES], and so on.

The goal is to be more consistent across Redis, so that code executed in one place can be fairly predictably understood in other contexts. Having commands that variadic responses breaks that.

Comment From: soloestoy

I can understand your point to some extent. If a command always produces consistent results regardless of the parameters used, the client implementation can be simplified. However, in my view, this seems overly idealistic. First of all, Redis already has many commands with parameters that yield different results. Moreover, this objective would lead Redis to lose a lot of flexibility. As a NoSQL database, Redis not only offers flexibility in data storage formats but also in data manipulation. I don't think we should impose too many restrictions in this regard, especially when compared to traditional relational databases.

Furthermore, in terms of objectives or rules, we currently don't have strict guidelines. Even among the same developer, there can be preferences and trade-offs, and different reactions when faced with different user requirements. They may even deviate from the rules they advocated before, which is quite common in the development process of Redis. Therefore, if we really want to establish a rule, I believe it should be "based on meeting the reasonable needs of the majority of users" rather than a rigid framework.

Returning to this issue, I believe this requirement is reasonable. Handling the "novalues" parameter is not a big problem for the client, similar to handling "WITHSCORES". If there is a requirement for the command's result to be consistent, an option could be to add a new command like "HSCANKEYS". However, this approach is clearly less elegant than adding a "novalues" parameter. Additionally, introducing a new command would increase the workload for the client and add to the understanding cost for users, rather than reducing them.

Comment From: soloestoy

We discussed it in the core-team meeting, and we believe it is a reasonable requirement. We have decided to support both HSCAN [NOVALUES] and HKEYS [PATTERN]. @CharlesChen888 please go ahead to finish these two.

As for ZSCAN, we haven't seen a clear demand for it yet, so we have decided to suspend it temporarily. We will continue with it when there is a reasonable use case.

Comment From: oranagra

@soloestoy i was under the impression that we only wanted HSCAN NOVALUES (without HKEYS [PATTERN]). @redis/core-team please comment.

Comment From: soloestoy

i was under the impression that we only wanted HSCAN NOVALUES (without HKEYS [PATTERN]).

I checked our doc, our decision is "Add HKEYS with pattern and HSCAN [NOVALUES]".

Comment From: oranagra

i don't know who wrote that note in the doc. @madolson maybe you can fill in that gap?

Comment From: oranagra

it was discussed again in a core-team meeting, and we decided to only go with HSCAN NOVALUES. we concluded that filtering on HKEYS is probably only relevant in small hashes (big ones will require HSCAN), and thus it may be a convenient feature (if you're lazy doing filtering on the client side), and if we consider doing that, we'd also need filtering for HGETALL, and ZSCAN.