Redis Redis Cluster Mode not announcing the static IP

We have a Redis cluster mode enabled up and running in an EC2 instance(can't use the AWS managed one) and to connect to it from our internal network we are announcing our IP and Port using cluster-announce-ip,cluster-announce-port and cluster-bus-port where the IP announced is accessible from our network but not internally in the EC2 instance. It seems to be working fine but not stable i.e. it keeps on switching between the IP provided and the loopback address, see below:

internal:36379> cluster nodes
eec09ffe56b05ad12b615b1d72fb6759f9c442dd internal:36379@40002 slave,fail b48e9381bfc8870317890483f3a610195a88c726 1580294719147 1580294718345 8 connected
ca0cb878becba2270cf00ec75be806304d561b0b internal:36379@40003 slave 3a5c9bc26bb3fbc7e850199320595946f3a6569a 1580294720250 1580294719347 9 connected
3a5c9bc26bb3fbc7e850199320595946f3a6569a internal:30006@40006 myself,master - 0 1580294720000 9 connected 10923-16383
33de5143f47674dd0fc636404fe4d7752d2cf9e2 internal:36379@40004 master - 1580294720651 1580294720151 7 connected 0-5460
b48e9381bfc8870317890483f3a610195a88c726 internal:36379@40005 master,fail - 1580294721253 1580294721153 8 connected 5461-10922
74cd3e1ededd204408e2dabce022bd08ab6b03b3 internal:36379@40001 slave,fail 33de5143f47674dd0fc636404fe4d7752d2cf9e2 1580215532177 1580215531374 7 connected
internal:36379> 
internal:36379> cluster nodes
eec09ffe56b05ad12b615b1d72fb6759f9c442dd 127.0.0.1:30002@40002 slave b48e9381bfc8870317890483f3a610195a88c726 0 1580294727000 8 connected
ca0cb878becba2270cf00ec75be806304d561b0b 127.0.0.1:30003@40003 slave 3a5c9bc26bb3fbc7e850199320595946f3a6569a 0 1580294727381 9 connected
3a5c9bc26bb3fbc7e850199320595946f3a6569a internal:30006@40006 myself,master - 0 1580294726000 9 connected 10923-16383
33de5143f47674dd0fc636404fe4d7752d2cf9e2 internal:36379@40004 master,fail? - 1580294725274 1580294725000 7 connected 0-5460
b48e9381bfc8870317890483f3a610195a88c726 127.0.0.1:30005@40005 master,fail - 0 1580294727080 8 connected 5461-10922
74cd3e1ededd204408e2dabce022bd08ab6b03b3 internal:36379@40001 slave,fail 33de5143f47674dd0fc636404fe4d7752d2cf9e2 1580215532177 1580215531374 7 connected
internal:36379>

Where internal is one of our six internal IPs.Originally cluster is running on ports 30001-30006.We are able to set/get keys momentarily before it switches back to announcing the local address instead of our IP. Any idea why this is not stable?

Comment From: contributenow

I can see the below in the logs: 22775:S 30 Jan 2020 09:47:19.927 * Connecting to MASTER internal:36379 22775:S 30 Jan 2020 09:47:19.928 * MASTER <-> REPLICA sync started 22775:S 30 Jan 2020 09:47:21.532 # Cluster state changed: fail 22775:S 30 Jan 2020 09:47:22.134 # Cluster state changed: ok 22775:S 30 Jan 2020 09:47:24.215 # Address updated for node 4f2df6b3380426603666d105332615513e89f5e6, now internal:36379 22775:S 30 Jan 2020 09:47:24.456 # Address updated for node 38ba74aa72e032b77b8d4f8c83343bac6ac17293, now internal:36379 22775:S 30 Jan 2020 09:47:24.617 # Address updated for node 86bf61d06ef39e41214ef8545fcbf5a8d6438a3e, now internal:36379 22775:S 30 Jan 2020 09:47:24.937 * Connecting to MASTER internal:36379 22775:S 30 Jan 2020 09:47:24.937 * MASTER <-> REPLICA sync started 22775:S 30 Jan 2020 09:47:25.018 * FAIL message received from 86bf61d06ef39e41214ef8545fcbf5a8d6438a3e about 3881954a24aa786e3a8112d4bfd559cd64ca0895 22775:S 30 Jan 2020 09:47:25.839 * Clear FAIL state for node 3881954a24aa786e3a8112d4bfd559cd64ca0895: replica is reachable again. 22775:S 30 Jan 2020 09:47:26.339 # Address updated for node 3881954a24aa786e3a8112d4bfd559cd64ca0895, now internal:36379 22775:S 30 Jan 2020 09:47:26.422 # Address updated for node 4f2df6b3380426603666d105332615513e89f5e6, now internal:36379 22775:S 30 Jan 2020 09:47:26.540 # Cluster state changed: fail 22775:S 30 Jan 2020 09:47:27.142 # Cluster state changed: ok 22775:S 30 Jan 2020 09:47:28.644 # Address updated for node 3881954a24aa786e3a8112d4bfd559cd64ca0895, now internal:36379 22775:S 30 Jan 2020 09:47:28.830 # Address updated for node 4f2df6b3380426603666d105332615513e89f5e6, now internal:36379 22775:S 30 Jan 2020 09:47:29.465 # Address updated for node 38ba74aa72e032b77b8d4f8c83343bac6ac17293, now 1internal:36379 I think something like this is happening, it's trying to connect to the master(which is the announced internal IP and not accessible internally from the EC2 instance) and it's falling back to the loopback address then after sometime since we are announcing a different IP, it will try to re-announce that.

Comment From: rokcarl

I have the same problem. I have three servers, each running a master and a replica. When I do cluster nodes, I can see 127.0.0.1 for the master and/or replica node on the same machine I'm connected to. There doesn't seem to be a great reason for why some nodes are shown as 127.0.0.1 and some as their IPs. All the nodes are bound to 0.0.0.0. All were created basically the same way. I do see in some logs # IP address for this node updated to 127.0.0.1 or IP address for this node updated to 123.456.789.001.

The actual problem comes when I try to connect to the cluster from the Lettuce Java client library from a machine outside the cluster. It is cluster-aware. I think it sees a 127.0.0.1 in cluster nodes and then tries to connect to 127.0.0.1, but since this is on a machine outside the cluster, this is the wrong IP.

Cluster nodes result from server 1

e1023c2e74bc9fe040731ed17514c79aaf0f04e0 127.0.0.1:7000@17000 slave 5eb86568d6a961dbd7c2c27a2c60a20b0ef38b18 0 1589435606518 3 connected b9bd5aba268d51052119211d35ca323b361e54e7 111.222.333.003:7000@17000 slave 417debe112bd86fa91effebb8867b4cfcc11f755 0 1589435606820 2 connected bff62c38595b000e31e74a701cfc77dfbf7fb092 111.222.333.002:7000@17000 slave bc33ca3667a70fac71e11948b6cea7ed2346cf6a 0 1589435605516 1 connected bc33ca3667a70fac71e11948b6cea7ed2346cf6a 111.222.333.001:6379@16379 myself,master - 0 1589435605000 1 connected 0-5460 417debe112bd86fa91effebb8867b4cfcc11f755 111.222.333.002:6379@16379 master - 0 1589435604815 2 connected 5461-10922 5eb86568d6a961dbd7c2c27a2c60a20b0ef38b18 111.222.333.003:6379@16379 master - 0 1589435605817 3 connected 10923-16383

Cluster nodes result from server 3 with problems on master and replica

e1023c2e74bc9fe040731ed17514c79aaf0f04e0 111.222.333.001:7000@17000 slave 5eb86568d6a961dbd7c2c27a2c60a20b0ef38b18 0 1589435618087 3 connected 5eb86568d6a961dbd7c2c27a2c60a20b0ef38b18 127.0.0.1:6379@16379 myself,master - 0 1589435619000 3 connected 10923-16383 bc33ca3667a70fac71e11948b6cea7ed2346cf6a 111.222.333.001:6379@16379 master - 0 1589435618000 1 connected 0-5460 417debe112bd86fa91effebb8867b4cfcc11f755 111.222.333.002:6379@16379 master - 0 1589435618000 2 connected 5461-10922 bff62c38595b000e31e74a701cfc77dfbf7fb092 111.222.333.002:7000@17000 slave bc33ca3667a70fac71e11948b6cea7ed2346cf6a 0 1589435619593 1 connected b9bd5aba268d51052119211d35ca323b361e54e7 127.0.0.1:7000@17000 slave 417debe112bd86fa91effebb8867b4cfcc11f755 0 1589435619091 2 connected

The cluster was created by first running all server nodes, then redis-cli --cluster create with the three master nodes, then redis-cli --cluster add-node [ip]:7000 127.0.0.1:6379 --cluster-slave --cluster-master-id [master-id-on-another-server].

An additional question here: should the client be smarter about this and either remap 127.0.0.1 to the actual IP or maybe use some other way of getting the cluster state (I'm assuming it's taking cluster nodes info)?