Redis [BUG] redis-cli SCAN command in cluster mode picks nodes in random

Describe the bug

SCAN command in cluster mode picks nodes in random

To reproduce

redis-cli -c -h <redis-host-in-cluster-mode>
scan 0 match * count 500

Expected behavior

Scan should return from which node the command was run or Scan should not work without providing a node

Current behavior

Runs the SCAN command on a random node in the cluster

Additional information

Any additional information that is relevant to the problem.

Comment From: judeng

Perhapse I not get your real question, I just tested it, I don't think this is a bug and it also not random reply, . When no key is specified, redis-cli connects to the current node, and the behavior of the scan command is consistent with that of info and keys.

redis-cli -c -h 127.0.0.1 -p 6482
127.0.0.1:6482> scan 0
1) "0"
2) 1) "3"
127.0.0.1:6482> keys *
1) "3"
127.0.0.1:6482> set 66 123
-> Redirected to slot [5651] located at 127.0.0.1:6484
OK
127.0.0.1:6484> scan 0                                 <------------------------the 127.0.0.1:6484 could be a hint to scan command
1) "0"
2) 1) "c"
   2) "66"
127.0.0.1:6484>keys *
1) "c"
2) "66"

Comment From: KrishnaPravin

@judeng I am using AWS Elasticache. I connect to the cluster using the Configuration endpoint not the node endpoint (All nodes have separate endpoints). All the nodes run in the same port(6379).

Configuration endpoint: clustercfg.redis-cluster.1111.use1.cache.amazonaws.com Node Endpoints: (for 3 shards)

redis-cluster-0001-001.redis-cluster.1111.use1.cache.amazonaws.com:6379
redis-cluster-0001-002.redis-cluster.1111.use1.cache.amazonaws.com:6379
redis-cluster-0002-001.redis-cluster.1111.use1.cache.amazonaws.com:6379
redis-cluster-0002-002.redis-cluster.1111.use1.cache.amazonaws.com:6379
redis-cluster-0003-001.redis-cluster.1111.use1.cache.amazonaws.com:6379
redis-cluster-0003-002.redis-cluster.1111.use1.cache.amazonaws.com:6379

In your example, I think you connected to the nodes directly (running in different ports) and ran the scan command.

When you use the set command while connected to the node in 6482, you are still able to write to the node in 6484. But when you use the scan command in node 6482, it did not return "c" which was present in node 6484. It returned only "3".

In cluster mode keys, scan works differently and set, get works differently.

When I connect directly to the node, I will not expect the commands to work across the cluster. But I am using the base URL for the cluster (from AWS Elasticache) -- which does not point to a node.

I did not mean random reply, I meant random nodes

If I connect to Redis cluster using a base cluster URL and run the scan, keys command I want to know that it does not consider all nodes. 1. Send the node information along the scan result as 3rd item in the array or 2. Make specifying node a mandatory param while running scan command in cluster mode or 3. Clearly mention in documentation that scan,keys works differently than get,set in cluster mode

Comment From: itamarhaber

Hello @KrishnaPravin

IIUC, the terms "base cluster URL" and "configuration endpoint" are specific to AWS. When you connect using these, you apparently get a "random node", hence your experience.

That said, the OSS Redis project doesn't support this functionality at the moment, only direct node connections. Therefore I don't see a reason for changing the command, redis-cli, or the docs to accommodate this provider-specific behavior.

Comment From: KrishnaPravin

@itamarhaber I understand that in OSS redis there is nothing called "base cluster URL".

But I also don't accept that OSS redis-cli supports only direct node connections

If that is true, why does the connection to redis(AWS Elasticache) using the "base cluster URL" work in redis-cli? The connection attempt should be failing saying the URL does not belong to a node.

Currently, it does not fail. When I use the "base cluster URL" in redis-cli it does connect to a random node in that cluster. So you do support provider-specific behavior.

Comment From: soloestoy

I think you can submit a ticket to AWS, or I can help ping @madolson : )

Comment From: KrishnaPravin

@soloestoy What should I ask AWS to do ?

Comment From: madolson

@KrishnaPravin So, I'll give you some thoughts: 1. I don't know if I really agree with itamar about "direct node access only". Generally we have a notion of "seed nodes" which we use for cluster topology discovery. The redis-cli will do this when doing management operations in certain circumstances as well. Clients also extensively do this. I know of other Redis users who through some percentage of nodes behind a DNS and throw debugging stuff at it, this is pretty common in K8 workloads. I think the CLI should support this, but I don't think the current behavior is "wrong", it's just it's doing literally what you are telling it to do. FWIW, you can pretty easily write a script that calls CLUSTER SHARDS, collects all the IPs, and then scans them individually. 2. We kind of need this, https://github.com/redis/redis/issues/2702, which will actually respect slot boundaries and allow you to do a complete incremental cluster scan. I think this is likely the better solution to your problem.

Comment From: KrishnaPravin

@madolson 1. Yes, At last, I did scan all of them individually one by one 2. Better scan support would be helpful - the documentation for scan command also needs to be updated 3. keys command also behaves similarly to scan in cluster mode. Keys and any other commands that behave like scan in cluster mode also needs better support for cluster mode

Comment From: madolson

Ok, my suggestion here is that we do three things: 1. ~~Update the documentation around SCAN for usage in cluster mode.~~ 2. Support an incremental scan operation in the cli. 3. Implement the suggestion I have in the CR, which is an iterator which is truly global and will eventually return all keys.

Comment From: soloestoy

IIUC, the scenario is as follows: The user may have written a script that starts a redis-cli to execute the SCAN command and records the returned cursor. The next time the script starts a redis-cli, it uses this cursor value to traverse. Each time the redis-cli is started, it means a new connection needs to be established. However, the cluster URL of AWS randomly points to a backend node, and it is possible to connect to a different stage from before. As a result, the previously maintained cursor value becomes invalid.

So, the key point is that in the SCAN command, we maintain the state by cursor rather than by connection. But I don't think it is OSS Redis' problem.

Comment From: madolson

So, the key point is that in the SCAN command, we maintain the state by cursor rather than by connection. But I don't think it is OSS Redis' problem.

I don't disagree with this, it seems like the problem can be fixed by pointing to a specific node. Do you disagree with point 2 though, which is should we support a redis-cli extension so it can scan a cluster? It could discover the topology and scan each node.

Comment From: zidoo

Hi there, I ran into this issue, which should be fixed. The beauty of Redis cluster design is that it is very transparent, and it should stay like that. All clients connected in non-cluster models should have resulted only from specific nodes. In my particular scenario, we want to move from one Redis cluster to another as fast as possible and with 0 downtime. That is a very reasonable scenario for many users. The best way should be:

connect to each shard of the cluster
dump keys and ttl using pipeline
write that to cluster
extras: if the destination cluster has the same amount of shards, it should be able to write using pipelines

In this case, clutter will be utilized 100%; now, with a scan comment that returns keys from everywhere, I cannot use pipelines, and parallelization is also an issue.