Describe the bug This is the same issue that Dimitry reported before and I do not see any answers to it. Cluster nodes report loopback addresses #4094

Reference: https://github.com/redis/redis/issues/4094

We are running a Cache in Yahoo Mail, with Redis Cluster and a Web service to talk to the Redis cluster. The Linux boxes that we have are running Redis instances (part of cluster) Or Web Service Or Both. The Web service receives heavy traffic and the boxes are loaded. The Redis and Web Service are highly tuned.

We have seen the issue where Redis cluster nodes command reports the loopback address which is the IP of the VIP (load balancer). This config corruption throws the Lettuce Client off-balance and we start seeing errors. Do you think the Redis cluster config corruption might be due to some bind issue or due to the heavy load or something else? PFA a snippet of the client connection log. Please let me know if you need additional information / logs, I can collect and send it to you.

We are running bind on the server side.

A short description of the bug.

To reproduce

Steps to reproduce the behavior and/or a minimal code sample.

Expected behavior

A description of what you expected to happen.

Additional information

Any additional information that is relevant to the problem.

Comment From: yossigo

@ashutoshvsingh Can you please include the logs you mentioned? Also, redis.conf could be useful.

Comment From: ashutoshvsingh

We are adding more information. Let us know if it would help to go on hangout or zoom, we can do that too.

Comment From: ashutoshvsingh

here is the content of nodes-6002.conf dd890ed155250be49c4ddc67afc05239cccfcfd1 10.249.249.218:6001@16001 master - 0 1611195926523 7 connected 3277-4914 76d10017755938823a63c0b2e4bd04ca5775c26c 10.249.250.26:6001@16001 master - 0 1611195927000 1 connected 0-1637 5ed16aeed778f991981a2c53c5471c323f5b3e13 10.250.26.80:6003@16003 slave 474eb1fbc8a6fa7d36e1c38238e4bc5203b043a6 0 1611195930000 25 connected db5bb05bc8fded8253aaed96d7f950b5c6241075 10.249.227.133:6003@16003 slave e0c8bf525629f6921bcada858a66bdf7330ee250 0 1611195925117 19 connected 6d1ce26e33d2a7c29b619cd72088d6e6436a766f 10.249.227.133:6001@16001 master - 0 1611195928129 10 connected 4915-6553 329cc2f9519513f67c75303282ef899b20cdf149 10.249.250.28:6002@16002 slave 76d10017755938823a63c0b2e4bd04ca5775c26c 0 1611195927123 5 connected ca135d2345477a7eb1300641349a3e1b624ee505 10.249.250.26:6002@16002 slave cf8383145f603e08a0f8e81f5d8ab193c424ea6c 0 1611195930000 4 connected e0c8bf525629f6921bcada858a66bdf7330ee250 10.250.26.80:6001@16001 master - 0 1611195929536 19 connected 9830-11468 c2ed05c52f16de98a54416bd37e470a2001abc54 10.249.227.159:6002@16002 slave dd890ed155250be49c4ddc67afc05239cccfcfd1 0 1611195927000 14 connected 6512938a6b1b460e34512f6906935357b211592e 10.250.26.80:6002@16002 slave 6d1ce26e33d2a7c29b619cd72088d6e6436a766f 0 1611195927000 20 connected 78ba04f5310dbe05fb68fb5d5a7bc108d16e7e0d 10.250.31.144:6002@16002 slave 903454cef6821006a9ae7bcc6e2588102aa353fa 0 1611195926123 26 connected 17d02472f7bdd19e0bc1fc9fccc62e66e12e3ea6 10.194.198.183:6002@16002 myself,slave 903454cef6821006a9ae7bcc6e2588102aa353fa 0 1611195927000 29 connected 903454cef6821006a9ae7bcc6e2588102aa353fa 10.249.227.159:6001@16001 master - 0 1611195930000 13 connected 6554-8191 df36e7228c8a2a9a21320801fcd90bdc4cac3513 10.249.227.133:6002@16002 slave cf8383145f603e08a0f8e81f5d8ab193c424ea6c 0 1611195927000 11 connected 95dcb7a7b6ee50c65a304fbb5e11c6410d3eb669 10.249.227.159:6003@16003 slave e844cb43011a7beb8c01cee0144a621a66979c59 0 1611195930000 22 connected cf8383145f603e08a0f8e81f5d8ab193c424ea6c 10.249.250.28:6001@16001 master - 0 1611195926123 4 connected 1638-3276 e9dec275f248435ab1a7a9fa163bbb042c1c17bd 10.249.250.28:6003@16003 master - 0 1611195929000 31 connected 8192-9829 c3a91feea2861b820f6040d023ab5f63b3f69236 10.249.210.197:6002@16002 slave dd890ed155250be49c4ddc67afc05239cccfcfd1 0 1611195927970 17 connected 474eb1fbc8a6fa7d36e1c38238e4bc5203b043a6 10.250.31.144:6001@16001 master - 0 1611195924000 25 connected 13107-14745 11e53d15fae961271e732967733173c449c2f618 10.250.31.144:6003@16003 slave 5f06060f61e56e30bb40aa1fb6c5e922e3bd9fa7 0 1611195926000 28 connected f7995fa81595fc586611c540cfe9733722c4dda9 10.250.26.81:6002@16002 slave 6d1ce26e33d2a7c29b619cd72088d6e6436a766f 0 1611195927000 23 connected 48f9357da5cb5e94f945ce32402e854e49062010 10.249.210.197:6003@16003 slave e844cb43011a7beb8c01cee0144a621a66979c59 0 1611195930634 22 connected 5f06060f61e56e30bb40aa1fb6c5e922e3bd9fa7 10.250.27.16:6001@16001 master - 0 1611195929132 28 connected 14746-16383 89919a8a210db7cf8ab6ea737761c4030dce589e 10.249.249.218:6003@16003 slave e0c8bf525629f6921bcada858a66bdf7330ee250 0 1611195928000 19 connected e844cb43011a7beb8c01cee0144a621a66979c59 10.250.26.81:6001@16001 master - 0 1611195927000 22 connected 11469-13106 037616381c463d2898a4bccf72d4eaa437a2ea5b 10.250.26.81:6003@16003 slave 474eb1fbc8a6fa7d36e1c38238e4bc5203b043a6 0 1611195930000 25 connected d05f1f4b6e0d264a6eca950182336b7ecdb2244c 10.249.210.197:6001@16001 slave e9dec275f248435ab1a7a9fa163bbb042c1c17bd 0 1611195930134 31 connected 9c6c125563e5cb74ce6930569903bfaffa9c8987 10.250.27.16:6003@16003 slave 5f06060f61e56e30bb40aa1fb6c5e922e3bd9fa7 0 1611195927000 30 connected 43e5f9e6b76a906c1d2c0c3c0e4b4bd83ff7d57c 10.249.249.218:6002@16002 slave 76d10017755938823a63c0b2e4bd04ca5775c26c 0 1611195927000 8 connected ec8614030139c3a232c646d48eb6866ee7a3b15a 10.249.250.26:6003@16003 slave e9dec275f248435ab1a7a9fa163bbb042c1c17bd 0 1611195929000 31 connected vars currentEpoch 31 lastVoteEpoch 0

10.194.198.183 is the IP of the load balancer. I am not sure how this showed up in nodes.conf.

Comment From: madolson

Any update on the logs? Can you also describe what the load balancer is doing in this situation?

My hunch is that for some reason the node is either processing gossip about itself or somehow trying to MEET with itself.

Comment From: nsendev

here is our redis.conf (we have multiple such clusters, all similar except max memory size)

############################ NETWORK

protected-mode no port 6001 tcp-backlog 64000

Close the connection after a client is idle for N seconds (0 to disable)

timeout 0

A reasonable value for this option is 60 seconds.

tcp-keepalive 0

########################### GENERAL

daemonize yes pidfile /var/run/redis-6001.pid supervised no loglevel notice logfile /opt/logs/jcache/redis-6001.log

########################## SNAPSHOTTING

Save the DB on disk:

save "" stop-writes-on-bgsave-error yes rdbcompression yes rdbchecksum yes dbfilename redis-6001.rdb dir /opt/jcache/redis/snapshot

############################ SLOW LOG

slowlog-log-slower-than 10000 slowlog-max-len 128

####################### EVENT NOTIFICATION

notify-keyspace-events ""

########################## LATENCY MONITOR

latency-monitor-threshold 0

########################## REDIS CLUSTER

cluster-enabled yes cluster-config-file nodes-6001.conf cluster-node-timeout 15000

########################## LUA SCRIPTING

lua-time-limit 5000

############################ SECURITY

requirepass redis_password_from_ckmsstore

############################# LIMITS

maxclients 10000 maxmemory 400mb maxmemory-policy noeviction

########################### REPLICATION

masterauth redis_password_from_ckmsstore slave-serve-stale-data yes slave-read-only yes repl-diskless-sync yes repl-diskless-sync-delay 5 repl-disable-tcp-nodelay no repl-timeout 1200 slave-priority 100 min-slaves-to-write 0 min-slaves-max-lag 10

######################## APPEND ONLY MODE

appendonly no appendfilename appendonly-6001.aof appendfsync no no-appendfsync-on-rewrite no auto-aof-rewrite-percentage 100 auto-aof-rewrite-min-size 64mb aof-load-truncated yes

######################### ADVANCED CONFIG

hash-max-ziplist-entries 512 hash-max-ziplist-value 64 list-max-ziplist-size -2 list-compress-depth 0 set-max-intset-entries 512 zset-max-ziplist-entries 128 zset-max-ziplist-value 64 hll-sparse-max-bytes 3000 activerehashing yes client-output-buffer-limit normal 0 0 0 client-output-buffer-limit slave 1024mb 1024mb 60 client-output-buffer-limit pubsub 32mb 8mb 60 hz 10 aof-rewrite-incremental-fsync yes

Comment From: nsendev

Regarding load balancer, we have L3DSR load balancer, that added in the box as loop back. Somehow redis topology shows that address as its own ip (randomly in some boxes).

Comment From: ashutoshvsingh

I would like to find the code where the node figures its IP out. Here is the typical information on a host for ifconfig. If you look at the output the lo1 and lo2 loopbacks are for two loadbalancers that are used by clients to hit a web service that has lettuce client, which talks to redis either on the local box or remote as described by the redis.conf.

ens1f0: flags=4163 mtu 1500 inet 10.218.199.9 netmask 255.255.255.192 broadcast 10.218.199.63 inet6 fe80::92e2:baff:feec:8f68 prefixlen 64 scopeid 0x20 ether 90:e2:ba:ec:8f:68 txqueuelen 5000 (Ethernet) RX packets 290613977710 bytes 99587520062953 (90.5 TiB) RX errors 0 dropped 95551 overruns 0 frame 0 TX packets 215272891654 bytes 43131034856949 (39.2 TiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73 mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10 loop txqueuelen 1000 (Local Loopback) RX packets 46135669208 bytes 5490956440728 (4.9 TiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 46135669208 bytes 5490956440728 (4.9 TiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo:1: flags=73 mtu 65536 inet 10.218.192.54 netmask 255.255.255.255 loop txqueuelen 1000 (Local Loopback)

lo:2: flags=73 mtu 65536 inet 10.221.1.99 netmask 255.255.255.255 loop txqueuelen 1000 (Local Loopback)

On Wednesday, January 27, 2021, 3:40:23 PM PST, Nabanita Sen <notifications@github.com> wrote:

Regarding load balancer, we have L3DSR load balancer, that added in the box as loop back. Somehow redis topology shows that address as its own ip (randomly in some boxes).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Comment From: madolson

Here is the code where a node figures out it's own IP: https://github.com/redis/redis/blob/unstable/src/cluster.c#L1696. I don't think we're triggering that though. There are other cases with cluster announce IPs, but you aren't using that.t. If you have any server logs for this host, and others, with some relevant times you starting experience this issue, that would also help a lot to rule this case out.

Comment From: ashutoshvsingh

Would you like to come on zoom or hangout, it will be easier to debug, but I will add more logs from the system

Sent from Yahoo Mail on Android

On Wed, Jan 27, 2021 at 9:27 PM, Madelyn Olsonnotifications@github.com wrote:

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Comment From: ashutoshvsingh

is there a way for me to core the process and dump the packet when a given IP shows in the gossip for self discovery.

Comment From: madolson

I'm not sure there is any native solution for that unfortunately. You could add some checks within Redis and build it yourself. When a node updates its own IP address, we can report it. If you want, we can add some of that instrumentation in Redis 6.2.

Comment From: sundarms

@madolson Here is some log ... 93295:M 05 Oct 2020 22:55:02.658 # Address updated for node 5826cccc3b02f4bbed5c9e2da3ee05d173b8b5d8, now 10.210.82.33:6034 93295:M 05 Oct 2020 22:55:02.720 # Address updated for node 0a05f1dbf0e59906b9b0ba745e00ae5215744dc4, now 10.210.82.33:6035 93295:M 05 Oct 2020 22:55:02.749 # Address updated for node 425f7c2a7a490caf65a7d3d090670c0abe448310, now 10.210.82.33:6036 ... From code i see it is updated(nodeUpdateAddressIfNeeded method) while handshake in progress or ping message is processed. Any idea in which case this can happen.

Comment From: salujasarab

We found the solution. The redis was not started with the bind configuration directive so it was listening for connections on all the available network interfaces on host. Since one of the interfaces was the VIP, the redis server can listen on that interface as well and over the period of time the IP of the instance would get updated to VIP IP in nodes.conf. This config once propagated to the client was throwing it off balance. The solution was to start redis with the bind directive, followed by the IP to bind to. The issue can be closed.

Comment From: yossigo

@salujasarab Thanks for reporting the resolution.

Redis [BUG] We are seeing a wrong IP in nodes.conf being used to reach a cluster node

############################ NETWORK

Close the connection after a client is idle for N seconds (0 to disable)

A reasonable value for this option is 60 seconds.

########################### GENERAL

########################## SNAPSHOTTING

Save the DB on disk:

############################ SLOW LOG

####################### EVENT NOTIFICATION

########################## LATENCY MONITOR

########################## REDIS CLUSTER

########################## LUA SCRIPTING

############################ SECURITY

############################# LIMITS

########################### REPLICATION

######################## APPEND ONLY MODE

######################### ADVANCED CONFIG