Redis Full resync with slave is done twice when slaveof <masterip> is a FQDN

In case the host parameter is provided as a FQDN (SLAVEOF host port), slave full resync is performed twice.

slave log:

[13868] 16 Oct 11:28:30.119 # Server started, Redis version 2.8.17 [13868] 16 Oct 11:28:43.881 * DB loaded from disk: 13.762 seconds [13868] 16 Oct 11:28:43.882 * The server is now ready to accept connections on port 6400 [13868] 16 Oct 11:28:43.888 * Connecting to MASTER detta1-redis-db-901.data.det:6400 [13868] 16 Oct 11:28:43.889 * MASTER <-> SLAVE sync started [13868] 16 Oct 11:28:43.889 * Non blocking connect for SYNC fired the event. [13868] 16 Oct 11:28:43.889 * Master replied to PING, replication can continue... [13868] 16 Oct 11:28:43.890 * Partial resynchronization not possible (no cached master) [13868] 16 Oct 11:28:43.890 * Full resync from master: de143d0176fbae863acfb19241628ff395a150e7:1 [13868] 16 Oct 11:29:01.898 * MASTER <-> SLAVE sync: receiving 1551510931 bytes from master [13868] 16 Oct 11:29:15.473 * MASTER <-> SLAVE sync: Flushing old data [13868] 16 Oct 11:29:18.733 * MASTER <-> SLAVE sync: Loading DB in memory [13868] 16 Oct 11:29:35.499 * MASTER <-> SLAVE sync: Finished with success [13868] 16 Oct 11:29:40.326 # Connection with master lost. [13868] 16 Oct 11:29:40.326 * Caching the disconnected master state. [13868] 16 Oct 11:29:40.326 * Discarding previously cached master state. [13868] 16 Oct 11:29:40.326 * SLAVE OF 10.86.3.175:6400 enabled (user request) [13868] 16 Oct 11:29:40.345 # CONFIG REWRITE executed with success. [13868] 16 Oct 11:29:41.120 * Connecting to MASTER 10.86.3.175:6400 [13868] 16 Oct 11:29:41.120 * MASTER <-> SLAVE sync started [13868] 16 Oct 11:29:41.120 * Non blocking connect for SYNC fired the event. [13868] 16 Oct 11:29:41.121 * Master replied to PING, replication can continue... [13868] 16 Oct 11:29:41.121 * Partial resynchronization not possible (no cached master) [13868] 16 Oct 11:29:41.121 * Full resync from master: de143d0176fbae863acfb19241628ff395a150e7:17244 [13868] 16 Oct 11:29:59.130 * MASTER <-> SLAVE sync: receiving 1551510931 bytes from master [13868] 16 Oct 11:30:09.797 * MASTER <-> SLAVE sync: Flushing old data [13868] 16 Oct 11:30:12.871 * MASTER <-> SLAVE sync: Loading DB in memory [13868] 16 Oct 11:30:27.986 * MASTER <-> SLAVE sync: Finished with success

master log:

[13319] 16 Oct 11:27:05.917 # Server started, Redis version 2.8.17 [13319] 16 Oct 11:27:18.744 * DB loaded from disk: 12.827 seconds [13319] 16 Oct 11:27:18.745 * The server is now ready to accept connections on port 6400 [13319] 16 Oct 11:28:43.890 * Slave asks for synchronization [13319] 16 Oct 11:28:43.890 * Full resync requested by slave. [13319] 16 Oct 11:28:43.890 * Starting BGSAVE for SYNC [13319] 16 Oct 11:28:43.892 * Background saving started by pid 13344 [13344] 16 Oct 11:29:01.838 * DB saved on disk [13344] 16 Oct 11:29:01.849 * RDB: 24 MB of memory used by copy-on-write [13319] 16 Oct 11:29:01.897 * Background saving terminated with success [13319] 16 Oct 11:29:15.265 * Synchronization with slave succeeded [13319] 16 Oct 11:29:40.326 # Connection with slave 10.86.3.176:6400 lost. [13319] 16 Oct 11:29:41.121 * Slave asks for synchronization [13319] 16 Oct 11:29:41.121 * Full resync requested by slave. [13319] 16 Oct 11:29:41.121 * Starting BGSAVE for SYNC [13319] 16 Oct 11:29:41.124 * Background saving started by pid 13346 [13346] 16 Oct 11:29:59.027 * DB saved on disk [13346] 16 Oct 11:29:59.028 * RDB: 22 MB of memory used by copy-on-write [13319] 16 Oct 11:29:59.129 * Background saving terminated with success [13319] 16 Oct 11:30:09.577 * Synchronization with slave succeeded

Comment From: badboy

[13868] 16 Oct 11:29:40.326 # Connection with master lost.

For some reason it loses the (already established) connection.

[13868] 16 Oct 11:29:40.326 * SLAVE OF 10.86.3.175:6400 enabled (user request)

and then a SLAVEOF is issued again, resulting in the full resync.

Are you sure this is only due to the use of a FQDN? Some client sends a slaveof again. Is this done automatically by your system? Or did you send that manually?

Comment From: everesio

It happens automatically and I can reproduce it every time. When the FQDN is changed to IP, full sync is performed once.

Comment From: mattsta

It happens automatically and I can reproduce it every time.

After a few attempts, I can't reproduce the name-based auto-disconnect on my systems.

Do you have more details about your steps or setup? Are you using redis-cli directly? A different redis client? A management application? What OS? Anything going through proxies or load balancers?

Comment From: everesio

The setup was done in the "2.6.32-431.29.2.el6.centos.plus.x86_64" running in VMware. There was only one NIC on each host, the same network and no load balancer was used. FQDN of master was configured in the redis.conf and redis slave was started. The instances were provision by Foreman/Puppet, but during the sync there was no interaction with provisioning system (maybe apart from DNS Lookup). After the slave was killed and the rewritten config replaced with the original one, after a slave start-up, the double sync happened again.

Comment From: mattsta

It looks like a strange networking issue since the connection just goes away without any user interaction. Are timeouts set too low? If you can still reproduce the problem try increasing the Redis log level to get more details about what's happening.

Comment From: everesio

As said it happens only with FQDN. It seems that it happens during / after config rewrite (FQDN => IP)

Params: timeout 300 tcp-keepalive 60 repl-timeout 180 repl-disable-tcp-nodelay no

set loglevel debug (on both master and slave)

master:

[8098] 24 Oct 20:54:28.899 * Slave asks for synchronization [8098] 24 Oct 20:54:28.900 * Full resync requested by slave. [8098] 24 Oct 20:54:28.900 * Starting BGSAVE for SYNC [8098] 24 Oct 20:54:28.927 * Background saving started by pid 24219 [24219] 24 Oct 20:54:49.248 * DB saved on disk [24219] 24 Oct 20:54:49.274 * RDB: 6 MB of memory used by copy-on-write [8098] 24 Oct 20:54:49.404 * Background saving terminated with success [8098] 24 Oct 20:55:04.426 * Synchronization with slave succeeded [8098] 24 Oct 20:55:37.963 # Connection with slave 10.86.4.141:6400 lost. [8098] 24 Oct 20:55:38.513 * Slave asks for synchronization [8098] 24 Oct 20:55:38.513 * Full resync requested by slave. [8098] 24 Oct 20:55:38.513 * Starting BGSAVE for SYNC [8098] 24 Oct 20:55:38.536 * Background saving started by pid 24221 [24221] 24 Oct 20:56:00.509 * DB saved on disk [24221] 24 Oct 20:56:00.542 * RDB: 6 MB of memory used by copy-on-write [8098] 24 Oct 20:56:00.616 * Background saving terminated with success [8098] 24 Oct 20:56:17.239 * Synchronization with slave succeeded

slave:

[1100] 24 Oct 20:54:15.410 # Server started, Redis version 2.8.17 [1100] 24 Oct 20:54:28.701 * DB loaded from disk: 13.291 seconds [1100] 24 Oct 20:54:28.701 * The server is now ready to accept connections on port 6400 [1100] 24 Oct 20:54:28.894 * Connecting to MASTER detta1-redis-db-901.data.det:6400 [1100] 24 Oct 20:54:28.895 * MASTER <-> SLAVE sync started [1100] 24 Oct 20:54:28.896 * Non blocking connect for SYNC fired the event. [1100] 24 Oct 20:54:28.896 * Master replied to PING, replication can continue... [1100] 24 Oct 20:54:28.897 * Partial resynchronization not possible (no cached master) [1100] 24 Oct 20:54:28.898 * Full resync from master: 5169384220cbb44a145c047930d787e83e957477:202434404 [1100] 24 Oct 20:54:49.402 * MASTER <-> SLAVE sync: receiving 1551513934 bytes from master [1100] 24 Oct 20:55:04.768 * MASTER <-> SLAVE sync: Flushing old data [1100] 24 Oct 20:55:07.740 * MASTER <-> SLAVE sync: Loading DB in memory [1100] 24 Oct 20:55:29.928 * MASTER <-> SLAVE sync: Finished with success [1100] 24 Oct 20:55:37.960 # Connection with master lost. [1100] 24 Oct 20:55:37.960 * Caching the disconnected master state. [1100] 24 Oct 20:55:37.960 * Discarding previously cached master state. [1100] 24 Oct 20:55:37.960 * SLAVE OF 10.86.4.140:6400 enabled (user request) [1100] 24 Oct 20:55:37.974 # CONFIG REWRITE executed with success. [1100] 24 Oct 20:55:38.509 * Connecting to MASTER 10.86.4.140:6400 [1100] 24 Oct 20:55:38.509 * MASTER <-> SLAVE sync started [1100] 24 Oct 20:55:38.509 * Non blocking connect for SYNC fired the event. [1100] 24 Oct 20:55:38.510 * Master replied to PING, replication can continue... [1100] 24 Oct 20:55:38.510 * Partial resynchronization not possible (no cached master) [1100] 24 Oct 20:55:38.511 * Full resync from master: 5169384220cbb44a145c047930d787e83e957477:202454910 [1100] 24 Oct 20:56:00.614 * MASTER <-> SLAVE sync: receiving 1551513934 bytes from master [1100] 24 Oct 20:56:17.432 * MASTER <-> SLAVE sync: Flushing old data [1100] 24 Oct 20:56:20.682 * MASTER <-> SLAVE sync: Loading DB in memory [1100] 24 Oct 20:56:38.769 * MASTER <-> SLAVE sync: Finished with success

Comment From: mattsta

You mention this:

during the sync there was no interaction with provisioning system

But I'm not sure that's true.

Your new log shows the disconnect is caused by a user initiated SLAVEOF command. A SLAVEOF command to an ip address. Something in your monitoring system is running "SLAVEOF 10.86.4.140 6400" even though you are already connected to detta1-redis-db-901.data.det:6400.

Comment From: mattsta

Also, are you using Sentinel at all? There's no mention of Sentinel above, but Sentinel will force all configurations to use IP addresses.

If you are using Sentinel, you'll see an event in your sentinel log like "+fix-slave-config" — Sentinel sees your replica is connected to "detta1-redis-db-901.data.det:6400", but it doesn't know that "detta1-redis-db-901.data.det" is the correct IP address for your master.

Comment From: mattsta

We should probably be forcing Redis to only act on IP addresses after name resolution anyway. Otherwise we have things like INFO output showing hostnames instead of IP addresses:

master_host:poo1333.vcap.me
master_port:6379

Using hostnames here makes Sentinel unable to verify the configuration, so Sentinel re-attaches the replica to the IP address of the same host.

The only change we need to make Redis use the IP of hostnames during SLAVEOF is:

diff --git a/src/replication.c b/src/replication.c
index 16014c8..af68ece 100644
--- a/src/replication.c
+++ b/src/replication.c
@@ -1305,6 +1305,15 @@ void slaveofCommand(redisClient *c) {
     } else {
         long port;

+        char *ip = c->argv[1]->ptr;
+        char ipbuf[REDIS_IP_STR_LEN];
+        if (anetResolve(NULL, ip, ipbuf, sizeof(ipbuf)) == ANET_OK) {
+            ip = ipbuf;
+        } else {
+            addReplyError(c, "Requested hostname can't be resolved.");
+            return;
+        }
+
         if ((getLongFromObjectOrReply(c, c->argv[2], &port, NULL) != REDIS_OK))
             return;

@@ -1317,7 +1326,7 @@ void slaveofCommand(redisClient *c) {
         }
         /* There was no previous master or the user specified a different one,
          * we can continue. */
-        replicationSetMaster(c->argv[1]->ptr, port);
+        replicationSetMaster(ip, port);
         redisLog(REDIS_NOTICE,"SLAVE OF %s:%d enabled (user request)",
             server.masterhost, server.masterport);
     }

Which gives us this INFO even after a SLAVEOF foo9.vcap.me 6379:

master_host:127.0.0.1
master_port:6379

The only problem with that approach: it could confuse people who think the hostname will be saved in the config for future connections. If the user later changes where their hostname points, expecting Redis to connect to the same hostname again, it won't work because Redis is already pinned (and config-saved/sentinel-saved) to the original IP address of the original hostname.

So, the other way to fix it is to just throw a big "NO HOSTNAMES IN SLAVEOF" error if somebody tries to do that.

The BIG ERROR approach would be:

diff --git a/src/replication.c b/src/replication.c
index 16014c8..a49c8c4 100644
--- a/src/replication.c
+++ b/src/replication.c
@@ -1305,6 +1305,11 @@ void slaveofCommand(redisClient *c) {
     } else {
         long port;

+        if (anetResolveIP(NULL, c->argv[1]->ptr, NULL, 0) != ANET_OK) {
+            addReplyError(c, "Must use only IP address with SLAVEOF");
+            return;
+        }
+
         if ((getLongFromObjectOrReply(c, c->argv[2], &port, NULL) != REDIS_OK))
             return;

which results in:

127.0.0.1:9999> slaveof foo7.vcap.me 9999
(error) ERR Must use only IP address with SLAVEOF

But wait—there's one more option where we mix allow config hostnames, but use IP addresses! The third fix would be: store the original host name to persist in the config file, but don't use it for live connection tracking. The live "config.masterhost" will always be an IP address but maybe a "config.originalmasterhost" would be the hostname, and the hostname is what would be persisted to the config file (but Sentinel would just check against the live IP address given at config.masterhost/INFO output).

Comment From: everesio

Yes, you right I use Sentinel (actually 4 of them). I haven't mention it as it seemed obvious for me - sorry about that. I am sure that your analysis is correct I will verify Sentinels' logs on Monday.

In the initial setup, all IPs i.e. in the “slaveof” and “sentinel monitor” were provided as FQDNs. All of them were rewritten at the end.

In my opinion, during the server/sentinel start-up, DNS-Lookup and some “initial” config rewrite (even if only in memory) would be the cleanest solution. This would provide consistent state as only IPs would be used and using FQDNs would be also possible. Manual commands should perform DNS-Lookup (probably on the server side) before application.

Thanks to everyone for the analysis !

Comment From: everesio

As you mentioned, I see "+fix-slave-config" in the sentinel.log

[8049] 24 Oct 20:54:15.465 * +reboot slave 10.86.4.141:6400 10.86.4.141 6400 @ contentexchange-cache 10.86.4.140 6400 [8049] 24 Oct 20:54:15.533 # -sdown slave 10.86.4.141:6400 10.86.4.141 6400 @ contentexchange-cache 10.86.4.140 6400 [8049] 24 Oct 20:55:17.985 * +fix-slave-config slave 10.86.4.141:6400 10.86.4.141 6400 @ contentexchange-cache 10.86.4.140 6400 [8049] 24 Oct 20:55:28.062 * +fix-slave-config slave 10.86.4.141:6400 10.86.4.141 6400 @ contentexchange-cache 10.86.4.140 6400 [8049] 24 Oct 21:22:48.923 * +reboot slave 10.86.4.141:6400 10.86.4.141 6400 @ contentexchange-cache 10.86.4.140 6400

Comment From: everesio

Should I close the issue or do you need as a reference for any CR/fix ?

Comment From: yossigo

The issue of Sentinel and hostnames should be resolved by #8282.