We have a cluster in production environment, each node maxmemory = 5G, and run in docker. Recently found that there will be a lot of timeout error at a specific time period in client(use Jedis) . Here are several features.

1: timeout happens instantaneous,usually focused on one cluster-node .And the next timeout may be other nodes.

2: when timeout happens, the host will have a TCP ListenOverflows and ListenDrops (somaxconn has been adjusted to 65535)

3: used_memory_rss suddenly decline (300M or so) when every timeout happens , but used_memory no change。 ( this is the most suspicious,we guess that decline leads to the timeout,the reason maybe defragmentation ?) The mem_fragmentation_ratio > 1.5 in each node, but when we use a new node to replace the original node,the problem still exists on new node with mem_fragmentation_ratio under 1.1

4: physical memory is enough, no swap used. The used_cpu_sys and used_cpu_user almost no fluctuate. no aof and rdb.

5:we will import a lot of key to cluster every morning , but the key expiration will not be at the same time. And with redis-cli -bigkeys check,we did not find a big key.

Here is the info of one original node。

Server

redis_version:3.0.7 redis_git_sha1:00000000 redis_git_dirty:0 redis_build_id:b155ac400ba794f5 redis_mode:cluster os:Linux 3.10.0-229.el7.x86_64 x86_64 arch_bits:64 multiplexing_api:epoll gcc_version:4.8.5 process_id:63 run_id:2b7497044359b44a920005ad90dd2d51b91605dd tcp_port:6379 uptime_in_seconds:7078780 uptime_in_days:81 hz:10 lru_clock:13237774 config_file:/etc/redis/6379.conf

Clients

connected_clients:451 client_longest_output_list:0 client_biggest_input_buf:0 blocked_clients:0

Memory

used_memory:2884907088 used_memory_human:2.69G used_memory_rss:4900552704 used_memory_peak:4955368520 used_memory_peak_human:4.62G used_memory_lua:34816 mem_fragmentation_ratio:1.70 mem_allocator:jemalloc-3.6.0

Persistence

loading:0 rdb_changes_since_last_save:19381351477 rdb_bgsave_in_progress:0 rdb_last_save_time:1483001404 rdb_last_bgsave_status:ok rdb_last_bgsave_time_sec:4 rdb_current_bgsave_time_sec:-1 aof_enabled:0 aof_rewrite_in_progress:0 aof_rewrite_scheduled:0 aof_last_rewrite_time_sec:8 aof_current_rewrite_time_sec:-1 aof_last_bgrewrite_status:ok aof_last_write_status:ok

Stats

total_connections_received:16595911 total_commands_processed:31640773934 instantaneous_ops_per_sec:8672 total_net_input_bytes:3972185951484 total_net_output_bytes:31879533477608 instantaneous_input_kbps:902.49 instantaneous_output_kbps:5054.55 rejected_connections:0 sync_full:3 sync_partial_ok:0 sync_partial_err:0 expired_keys:416395141 evicted_keys:0 keyspace_hits:17632621284 keyspace_misses:3021351932 pubsub_channels:0 pubsub_patterns:0 latest_fork_usec:56594 migrate_cached_sockets:0

Replication

role:master connected_slaves:1 slave0:ip=xxxx,port=6379,state=online,offset=2942777488580,lag=0 master_repl_offset:2942777500916 repl_backlog_active:1 repl_backlog_size:134217728 repl_backlog_first_byte_offset:2942643283189 repl_backlog_histlen:134217728

CPU

used_cpu_sys:475533.16 used_cpu_user:263417.94 used_cpu_sys_children:498.03 used_cpu_user_children:1399.48

Cluster

cluster_enabled:1

Keyspace

db0:keys=3873048,expires=581881,avg_ttl=4266465

May be the way of using cause of the problem. But what is the root cause? How can we avoid it.

Thanks.

Comment From: oranagra

sounds like this might be related to the allocator purging unused pages (returning them back to the OS). i would have suggest to monitor the allocator info and see what happens at that time, but sadly redis 3.0 doesn't have the capability of printing it. if you upgrade to redis 3.2, or backport these few lines of code, you can use DEBUG JEMALLOC INFO. then look at the line that prints the [allocated, active, metadata, resident] memory and see what happens at that time. another thing you can try is to monitor /proc//smaps file, but i doubt this would lead to anything useful.

Comment From: database-on-line

Have you solved this problem? I also encountered the same problem.

Comment From: oranagra

@database-on-line with which version? can you post details?

Comment From: database-on-line

@database-on-line with which version? can you post details?

2.8.14 maxmemory 40GB

when used_memory_rss becomes less 2.5G(30.7-28.3) mem_fragmentation_ratio 1.52 -> 1.39 used_memory 20.3G then get error connect timed out and connection count(69 -> 957)

Comment From: oranagra

that piece of software is 10 years old.

Comment From: database-on-line

that piece of software is 10 years old.

thank you,I know,but I don't know how to reproduce this problem,so I don't know which version can resolve it