Describe the bug

After a period of time, the 'eval' command will stop functioning correctly. Examples.

redis.call('exists', key) will always return the key regardless of the value or its existence redis.call('get', key) will always return the key regardless of the value or its existence

To reproduce

We've seen this issue occur twice. We current do not have deterministic steps to reproduce. In one case the issue occurred after approx. ~10 days of instance EVAL command usage before it surfaced. It can be corrected by restarting the Redis instance.

Expected behavior

LUA script calls should adhere to documented behavior and return a valid result or error.

Additional information

We have seen this issue occur on instances in Redis 6.0.7 and 6.0.9. We have never seen this issue occur on Redis 5.0.3.

Comment From: oranagra

@Sunbir you mean that a short running script (one without loops, a that exists after issuing a few commands) will start behaving incorrectly after running many times, and a restart of Redis solves it?

When you say it returns the key regardless of its value or its existence, what do you mean? What value does it return? How do you know the key doesn't exist? Are these volatile keys?

Comment From: Sunbir

@Sunbir you mean that a short running script (one without loops, a that exists after issuing a few commands) will start behaving incorrectly after running many times, and a restart of Redis solves it?

That is correct. In one case there was a loop (a small number of iterations < 100).

When you say it returns the key regardless of its value or its existence, what do you mean? What value does it return?

E.g. If the DB has a (key, value) pair of ('foo', 'bar')....

EVAL "return redis.call('get', 'foo')"

...would return "foo" as a result.

How do you know the key doesn't exist?

I checked replicas and they returned expected results.

Are these volatile keys?

No, I don't think key expiry was at play here.

Comment From: oranagra

So that state is remains for a long time, and you can tell that this key is missing on the replica, did you check on the master too (not using Lua)? It is quite impossible for the key to not exist, and for the script to return it anyway. What could in theory happen is that a command propagated a DEL to the replica and forgot to actually delete the key from the master. do you know how that key was deleted? which command, and if it was a module, script, multi-exec, or just a DEL command?

Comment From: Sunbir

Hi Oran, I think there is some confusion. GET/SET work fine. It seems the Lua interpreter enters a bad state and calls to redis.* executed by EVAL start returning wrong results. However the GET or SET of keys directly always works.

Comment From: oranagra

If the DB has a (key, value) pair of ('foo', 'bar').... EVAL "return redis.call('get', 'foo')" ...would return "foo" as a result.

sorry, i didn't notice it before, you mean that it returns the key name (as if that's the value)?

Comment From: oranagra

next time it happens, can you try to do SCRIPT FLUSH instead of a restart. it should reset the lua library, maybe something got corrupted in its stack.

Comment From: Sunbir

If the DB has a (key, value) pair of ('foo', 'bar').... EVAL "return redis.call('get', 'foo')" ...would return "foo" as a result.

sorry, i didn't notice it before, you mean that it returns the key name (as if that's the value)?

Yep, that's one of the symptoms.

Comment From: Sunbir

We encountered the issue again and have learned a few things:

  • SCRIPT FLUSH does not fix the issue

  • Write operations performed via EVAL still work (e.g redis.call('set', ...) eval "return redis.call('set', 'foo', 'bar')" 0 "bar" eval "return redis.call('get', 'foo')" 0 "foo" get foo "bar"

  • Any read operation just returns the last argument provided: eval "return redis.call('ping')" 0 "ping" eval "return redis.call('mget', 'foo', 'bar', 'baz')" 0 "baz" eval "return redis.call('randomkey')" 0 "randomkey"

  • A replica instance does not exhibit any issues when running read commands via EVAL. However when we failover the problematic instance to the replica it begins exhibiting the same strange behavior.

Comment From: yossigo

The internal Lua client state is not reset in this case, so if it gets into corrupt/unexpected state (e.g. not generating replies) that could lead to the same behavior.

@Sunbir Can you provide more information like output of INFO ALL on the environment where this reproduces (not necessarily after it reproduced)?

Comment From: Sunbir

Here is the output of INFO ALL. The instance is currently not in the broken state.

Server

redis_version:6.0.9 redis_git_sha1:d316f5e1 redis_git_dirty:0 redis_build_id:bd2220b5b9e0d9ef redis_mode:standalone os:Linux 5.4.77-7.el7pie x86_64 arch_bits:64 multiplexing_api:epoll atomicvar_api:atomic-builtin gcc_version:4.2.1 process_id:1 run_id:f23797edfff86eedc27203076f4f010d04c6bd91 tcp_port:6201 uptime_in_seconds:88830 uptime_in_days:1 hz:10 configured_hz:10 lru_clock:2984645 executable:/redis-server config_file:/mnt/work/redis-10000001.conf io_threads_active:0

Clients

connected_clients:372 client_recent_max_input_buffer:8 client_recent_max_output_buffer:0 blocked_clients:0 tracking_clients:0 clients_in_timeout_table:0

Memory

used_memory:10387456 used_memory_human:9.91M used_memory_rss:36241408 used_memory_rss_human:34.56M used_memory_peak:430454712 used_memory_peak_human:410.51M used_memory_peak_perc:2.41% used_memory_overhead:9528280 used_memory_startup:804296 used_memory_dataset:859176 used_memory_dataset_perc:8.97% allocator_allocated:10601416 allocator_active:12652544 allocator_resident:20078592 total_system_memory:540677386240 total_system_memory_human:503.55G used_memory_lua:48128 used_memory_lua_human:47.00K used_memory_scripts:1216 used_memory_scripts_human:1.19K number_of_cached_scripts:4 maxmemory:1000000000 maxmemory_human:953.67M maxmemory_policy:allkeys-lru allocator_frag_ratio:1.19 allocator_frag_bytes:2051128 allocator_rss_ratio:1.59 allocator_rss_bytes:7426048 rss_overhead_ratio:1.80 rss_overhead_bytes:16162816 mem_fragmentation_ratio:3.44 mem_fragmentation_bytes:25712680 mem_not_counted_for_evict:0 mem_replication_backlog:1048576 mem_clients_slaves:20512 mem_clients_normal:7605504 mem_aof_buffer:0 mem_allocator:jemalloc-5.1.0 active_defrag_running:0 lazyfree_pending_objects:0

Persistence

loading:0 rdb_changes_since_last_save:360791 rdb_bgsave_in_progress:0 rdb_last_save_time:1613547967 rdb_last_bgsave_status:ok rdb_last_bgsave_time_sec:0 rdb_current_bgsave_time_sec:-1 rdb_last_cow_size:675840 aof_enabled:0 aof_rewrite_in_progress:0 aof_rewrite_scheduled:0 aof_last_rewrite_time_sec:-1 aof_current_rewrite_time_sec:-1 aof_last_bgrewrite_status:ok aof_last_write_status:ok aof_last_cow_size:0 module_fork_in_progress:0 module_fork_last_cow_size:0

Stats

total_connections_received:473803 total_commands_processed:3972074 instantaneous_ops_per_sec:39 total_net_input_bytes:318458296 total_net_output_bytes:1348520040 instantaneous_input_kbps:0.75 instantaneous_output_kbps:1.79 rejected_connections:0 sync_full:2 sync_partial_ok:0 sync_partial_err:1 expired_keys:3858 expired_stale_perc:0.00 expired_time_cap_reached_count:0 expire_cycle_cpu_milliseconds:3202 evicted_keys:0 keyspace_hits:43699 keyspace_misses:126646 pubsub_channels:0 pubsub_patterns:0 latest_fork_usec:2212 migrate_cached_sockets:0 slave_expires_tracked_keys:0 active_defrag_hits:0 active_defrag_misses:0 active_defrag_key_hits:0 active_defrag_key_misses:0 tracking_total_keys:0 tracking_total_items:0 tracking_total_prefixes:0 unexpected_error_replies:0 total_reads_processed:4169264 total_writes_processed:4349069 io_threaded_reads_processed:0 io_threaded_writes_processed:0 password_matches:473208 alt_password_matches:0

Replication

role:master connected_slaves:1 slave0:ip=100.102.97.219,port=6201,state=online,offset=72320086626,lag=1 master_replid:803f1dcd8263dde50cb1619d7aa7fee7fce80949 master_replid2:860fcfa3b238858dd23aceda9e97c0d501a5b84a master_repl_offset:72320086626 second_repl_offset:72090959695 repl_backlog_active:1 repl_backlog_size:1048576 repl_backlog_first_byte_offset:72319038051 repl_backlog_histlen:1048576

CPU

used_cpu_sys:381.626564 used_cpu_user:371.242911 used_cpu_sys_children:0.030339 used_cpu_user_children:0.083104

Modules

Commandstats

cmdstat_exists:calls=139010,usec=228180,usec_per_call=1.64 cmdstat_get:calls=31038,usec=1133482,usec_per_call=36.52 cmdstat_multi:calls=9,usec=3,usec_per_call=0.33 cmdstat_exec:calls=8,usec=311,usec_per_call=38.88 cmdstat_set:calls=76552,usec=239396,usec_per_call=3.13 cmdstat_del:calls=171765,usec=647106,usec_per_call=3.77 cmdstat_setex:calls=3235,usec=19639,usec_per_call=6.07 cmdstat_ttl:calls=297,usec=1562,usec_per_call=5.26 cmdstat_zadd:calls=74467,usec=505466,usec_per_call=6.79 cmdstat_psync:calls=2,usec=4119,usec_per_call=2059.50 cmdstat_auth:calls=473208,usec=2709611,usec_per_call=5.73 cmdstat_script:calls=1274,usec=675862,usec_per_call=530.50 cmdstat_replconf:calls=88258,usec=149944,usec_per_call=1.70 cmdstat_client:calls=15,usec=31,usec_per_call=2.07 cmdstat_select:calls=2,usec=1,usec_per_call=0.50 cmdstat_slaveof:calls=2,usec=167,usec_per_call=83.50 cmdstat_setnx:calls=98664,usec=407751,usec_per_call=4.13 cmdstat_zpopmin:calls=165048,usec=1042641,usec_per_call=6.32 cmdstat_eval:calls=60039,usec=3930464,usec_per_call=65.47 cmdstat_info:calls=268276,usec=9650620,usec_per_call=35.97 cmdstat_ping:calls=2320905,usec=1517116,usec_per_call=0.65

Cluster

cluster_enabled:0

Keyspace

db0:keys=556,expires=398,avg_ttl=11173207

Comment From: yossigo

@Sunbir I had another look at this but came up with nothing so far. Please update if you get any additional information or find a reliable way to reproduce this.

Comment From: Sunbir

I'm working on trying to repro this consistently.

Comment From: Sunbir

Quick update: another internal Redis deployment experienced this issue. It occurred on a replica that was made primary after the original primary instance was terminated.

Comment From: Sunbir

I have a running replica instance exhibiting this behavior. Let me know if there is any add'l information I can provide. The Redis instance is running 6.0.10 and was built with Clang and is running on Oracle Linux Server 7.9

$ redis-cli -p -a eval "return redis.call('get','foo')" 0 Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe. "foo"

$ redis-cli -p -a info Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.

Server

redis_version:6.0.10 redis_git_sha1:00000000 redis_git_dirty:0 redis_build_id:75a316aa4aaf378 redis_mode:standalone os:Linux 4.14.35-1902.305.4.el7uek.x86_64 x86_64 arch_bits:64 multiplexing_api:epoll atomicvar_api:atomic-builtin gcc_version:4.9.1 process_id:44176 run_id:2902240a505fa97d77118fee1e64d0e416755c48 tcp_port:6201 uptime_in_seconds:13991 uptime_in_days:0 hz:10 configured_hz:10 lru_clock:11969830 executable:/opt/ais/local/CIEDb/redis-six-zero-ten/redis-six-zero-ten1BUILD44.18a8dcad06fc/bin/redis-server config_file:/opt/ais/local/work/redis/sharedqa_pv/101000201/redis.conf io_threads_active:0

Clients

connected_clients:875 client_recent_max_input_buffer:8 client_recent_max_output_buffer:0 blocked_clients:0 tracking_clients:0 clients_in_timeout_table:0

Memory

used_memory:186881808 used_memory_human:178.22M used_memory_rss:273817600 used_memory_rss_human:261.13M used_memory_peak:193951320 used_memory_peak_human:184.97M used_memory_peak_perc:96.36% used_memory_overhead:20711296 used_memory_startup:805144 used_memory_dataset:166170512 used_memory_dataset_perc:89.30% allocator_allocated:187873112 allocator_active:203857920 allocator_resident:209801216 total_system_memory:134798123008 total_system_memory_human:125.54G used_memory_lua:61440 used_memory_lua_human:60.00K used_memory_scripts:7664 used_memory_scripts_human:7.48K number_of_cached_scripts:14 maxmemory:8000000000 maxmemory_human:7.45G maxmemory_policy:allkeys-lru allocator_frag_ratio:1.09 allocator_frag_bytes:15984808 allocator_rss_ratio:1.03 allocator_rss_bytes:5943296 rss_overhead_ratio:1.31 rss_overhead_bytes:64016384 mem_fragmentation_ratio:1.46 mem_fragmentation_bytes:86099144 mem_not_counted_for_evict:0 mem_replication_backlog:1048576 mem_clients_slaves:0 mem_clients_normal:17872728 mem_aof_buffer:0 mem_allocator:jemalloc-5.1.0 active_defrag_running:0 lazyfree_pending_objects:0

Persistence

loading:0 rdb_changes_since_last_save:835891 rdb_bgsave_in_progress:0 rdb_last_save_time:1622568575 rdb_last_bgsave_status:ok rdb_last_bgsave_time_sec:-1 rdb_current_bgsave_time_sec:-1 rdb_last_cow_size:0 aof_enabled:0 aof_rewrite_in_progress:0 aof_rewrite_scheduled:0 aof_last_rewrite_time_sec:-1 aof_current_rewrite_time_sec:-1 aof_last_bgrewrite_status:ok aof_last_write_status:ok aof_last_cow_size:0 module_fork_in_progress:0 module_fork_last_cow_size:0

Stats

total_connections_received:279016 total_commands_processed:12882283 instantaneous_ops_per_sec:966 total_net_input_bytes:1884232603 total_net_output_bytes:149108049 instantaneous_input_kbps:120.14 instantaneous_output_kbps:8.12 rejected_connections:0 sync_full:0 sync_partial_ok:0 sync_partial_err:0 expired_keys:0 expired_stale_perc:0.00 expired_time_cap_reached_count:0 expire_cycle_cpu_milliseconds:0 evicted_keys:0 keyspace_hits:318653 keyspace_misses:835196 pubsub_channels:0 pubsub_patterns:0 latest_fork_usec:0 migrate_cached_sockets:0 slave_expires_tracked_keys:0 active_defrag_hits:0 active_defrag_misses:0 active_defrag_key_hits:0 active_defrag_key_misses:0 tracking_total_keys:0 tracking_total_items:0 tracking_total_prefixes:0 unexpected_error_replies:0 total_reads_processed:12634954 total_writes_processed:11483259 io_threaded_reads_processed:0 io_threaded_writes_processed:0 password_matches:279003 alt_password_matches:0

Replication

role:slave master_host:pv44q01if-ztei07033301.pv.if1.apple.com master_port:6201 master_link_status:up master_last_io_seconds_ago:0 master_sync_in_progress:0 slave_repl_offset:124158236026 slave_priority:100 slave_read_only:1 connected_slaves:0 master_replid:d4f59a1ca4505ceda348994d61f53b89dba3e379 master_replid2:0000000000000000000000000000000000000000 master_repl_offset:124158236026 second_repl_offset:-1 repl_backlog_active:1 repl_backlog_size:1048576 repl_backlog_first_byte_offset:124157187451 repl_backlog_histlen:1048576

CPU

used_cpu_sys:285.335566 used_cpu_user:752.527966 used_cpu_sys_children:0.000000 used_cpu_user_children:0.000000

Modules

Cluster

cluster_enabled:0

Keyspace

db0:keys=16483,expires=1639,avg_ttl=0

Comment From: MeirShpilraien

Hey @Sunbir, I would like to try to extract some information while the issue happened. Do you have debug symbols on Redis? If you do can we run a gdb command while the issue happened to extract some information? Or even better, can we get a core dump while the issue happened to analyze offline? If this is not possible, can I give you a private branch to run with that will contain some extra logs that will help identify the issue?

Comment From: Sunbir

We use the default build configuration with no add'l flags (% make BUILD_TLS=yes) so I believe debug symbols have been stripped.

Yes, we can run the private branch in a deployment where we've noticed this issue happen. We could also build it with debug symbols. Just let us know the details. Thanks!

Comment From: MeirShpilraien

@Sunbir if you compiled normally (make BUILD_TLS=yes) you should have the symbols so we can run the gdb command (if possible, it might be even better to compile without optimizations using make noopt BUILD_TLS=yes). The next time this happened, if you can run this command and share the output it will be great : gdb -p `pgrep -f redis-server` -ex="p server->lua_client->flags" --batch (assuming you have only one redis-server process. If you have more than one, please put the relevant PID after the -p option).

Comment From: Sunbir

Here's the output from a replica instance exhibiting the issue:

$ sudo gdb -p $pid -ex="p server->lua_client->flags" --batch [New LWP 44303] [New LWP 44182] [New LWP 44181] [New LWP 44180] [New LWP 44179] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x00007f1ffadd3ca7 in __memcmp_sse4_1 () from /lib64/libc.so.6 $1 = 320 [Inferior 1 (process 44176) detached]

Comment From: MeirShpilraien

@Sunbir so at least we have a clue now why this happens, we can see from the information you provided that the lua_client has a flag set which is CLIENT_CLOSE_AFTER_REPLY. Once this flag is set the Redis will not write the replies to the client buffer (because it thinks the client is going to be disconnected). No reply means no reply is pushed to the Lua stack and the return value will be the first element on the Lua stack which is the last element you sent to the call function. I tried to find how come this flag was set on the Lua client but could not find any way this could happen, we can continue in 3 directions:

  1. I can give you a branch to run that asserts that when this flag is set we are not using the Lua client. This will catch the next time it will happen and will give us a full detail crash report of when it happened. But it will crash the server, so if this is not acceptable then we can not try this approach and we should try option 2.
  2. I will still give you another branch that will not crash but instead will print the full report (including trace) to the log file and will continue. The downside of this approach is that it will take me longer to prepare it, so if options 1 is possible I prefer it over this one.
  3. If you can share your Lua script with us I can try to look at it and maybe reproduce it locally.

Let us know which one you prefer.

Comment From: Sunbir

Hi Meir, sorry for delayed response. We are validating some things internally here... this may not be a Redis issue. Once we validate the issue is on our end, I'll post an update so you can close this.

Comment From: MeirShpilraien

@Sunbir thanks but I do believe its a Redis issue if you managed to get this flag turned on the Lua client (this should not happened and if it does then its a bug, unless you changed the Redis source?)

Comment From: MeirShpilraien

@Sunbir any news?

Comment From: Sunbir

Yes, this was the result of a patch we had... we wanted to close connections from an instance that had failed-over to becoming a replica. An error in that logic caused this issue. Apologies for the time spent here. Please close/resolve this as needed. Thanks for your assistance in pinpointing the exact cause!

Comment From: madolson

😱 Better a bug on your end is better than a bug on our end!