Redis Unexpected reply to PSYNC from master: -Reading from master: Connection timed out

I am running into the following problem when restarting a Master + 2 Slaves after a first successful startup.

"Unexpected reply to PSYNC from master: -Reading from master: Connection timed out" "Bad protocol from MASTER, the first byte is not '$' (we received '+FULLRESYNC ef395cb5c2ee43444c0222a0f4f92e61b617166c 1'), are you sure the host and port are right?"

I am sure of my network settings/firewall as I can redis-cli from the slaves to the master but it seems that the Master sometime has a hard time accepting the connection and take a long time to reply to some commands (e.g. ~60s for info replication):

127.0.0.1:6379> info replication
# Replication
role:master
connected_slaves:0
master_repl_offset:15431
repl_backlog_active:1
repl_backlog_size:8388608
repl_backlog_first_byte_offset:2
repl_backlog_histlen:15430
(57.15s)

Master log

[4087] 17 Jul 17:46:21.194 * The server is now ready to accept connections on port 6379
[4087] 17 Jul 17:46:27.214 * Slave asks for synchronization
[4087] 17 Jul 17:46:33.195 * Full resync requested by slave.
[4087] 17 Jul 17:46:39.195 * Starting BGSAVE for SYNC
[4087] 17 Jul 17:46:47.196 * Background saving started by pid 4142
[4142] 17 Jul 17:46:47.202 * DB saved on disk
[4087] 17 Jul 17:46:53.198 # Connection with slave 10.45.80.20:6379 lost.
[4142] 17 Jul 17:46:55.197 * RDB: 0 MB of memory used by copy-on-write
[4087] 17 Jul 17:47:05.207 * Background saving terminated with success
[4087] 17 Jul 17:47:22.696 * Slave asks for synchronization
[4087] 17 Jul 17:47:27.199 * Full resync requested by slave.
[4087] 17 Jul 17:47:31.199 * Starting BGSAVE for SYNC
[4087] 17 Jul 17:47:37.201 * Background saving started by pid 4172
[4172] 17 Jul 17:47:37.204 * DB saved on disk
[4087] 17 Jul 17:47:41.207 * Slave asks for synchronization
[4172] 17 Jul 17:47:43.201 * RDB: 0 MB of memory used by copy-on-write
[4087] 17 Jul 17:47:49.201 * Full resync requested by slave.
[4087] 17 Jul 17:47:55.201 * Waiting for end of BGSAVE for SYNC
[4087] 17 Jul 17:48:01.202 * Background saving terminated with success
[4087] 17 Jul 17:48:09.203 # Connection with slave 10.45.80.20:6379 lost.
[4087] 17 Jul 17:48:17.203 * Synchronization with slave succeeded
[4087] 17 Jul 17:48:23.224 * Slave asks for synchronization
[4087] 17 Jul 17:48:29.204 * Full resync requested by slave.
[4087] 17 Jul 17:48:35.205 * Starting BGSAVE for SYNC
[4087] 17 Jul 17:48:43.206 * Background saving started by pid 4205
[4205] 17 Jul 17:48:43.210 * DB saved on disk
[4087] 17 Jul 17:48:51.206 * Slave asks for synchronization
[4205] 17 Jul 17:48:53.206 * RDB: 0 MB of memory used by copy-on-write
[4087] 17 Jul 17:49:01.207 * Full resync requested by slave.
[4087] 17 Jul 17:49:11.207 * Waiting for end of BGSAVE for SYNC
[4087] 17 Jul 17:49:19.209 * Background saving terminated with success
[4087] 17 Jul 17:49:27.209 * Synchronization with slave succeeded
[4087] 17 Jul 17:49:37.210 # Connection with slave 10.45.80.30:6379 lost.
[4087] 17 Jul 17:49:41.275 * Slave asks for synchronization
[4087] 17 Jul 17:49:47.211 * Full resync requested by slave.
[4087] 17 Jul 17:49:57.213 * Starting BGSAVE for SYNC
[4087] 17 Jul 17:50:07.214 * Background saving started by pid 4250
[4250] 17 Jul 17:50:07.220 * DB saved on disk
[4087] 17 Jul 17:50:23.217 # Connection with slave 10.45.80.20:6379 lost.
[4250] 17 Jul 17:50:25.215 * RDB: 0 MB of memory used by copy-on-write
[4087] 17 Jul 17:50:35.258 * Background saving terminated with success
[4087] 17 Jul 17:50:47.217 * Slave asks for synchronization
...

Slaves log

[4002] 17 Jul 17:46:10.032 * MASTER <-> SLAVE sync started
[4002] 17 Jul 17:46:12.033 * Non blocking connect for SYNC fired the event.
[4002] 17 Jul 17:46:27.201 * Master replied to PING, replication can continue...
[4002] 17 Jul 17:46:27.204 * Partial resynchronization not possible (no cached master)
[4002] 17 Jul 17:46:32.209 # Unexpected reply to PSYNC from master: -Reading from master: Connection timed out
[4002] 17 Jul 17:46:36.034 * Retrying with SYNC...
[4002] 17 Jul 17:46:39.185 # Bad protocol from MASTER, the first byte is not '$' (we received '+FULLRESYNC ef395cb5c2ee43444c0222a0f4f92e61b617166c 1'), are you sure the host and port are right?
[4002] 17 Jul 17:46:40.536 * Connecting to MASTER 10.45.80.10:6379
[4002] 17 Jul 17:46:42.035 * MASTER <-> SLAVE sync started
[4002] 17 Jul 17:46:44.035 * Non blocking connect for SYNC fired the event.
[4002] 17 Jul 17:47:03.194 * Master replied to PING, replication can continue...
[4002] 17 Jul 17:47:13.042 * (Non critical) Master does not understand REPLCONF listening-port: -Reading from master: Connection timed out
[4002] 17 Jul 17:47:18.038 * Partial resynchronization not possible (no cached master)
[4002] 17 Jul 17:47:24.038 # Unexpected reply to PSYNC from master: +OK
[4002] 17 Jul 17:47:30.039 * Retrying with SYNC...
[4002] 17 Jul 17:47:55.191 # Bad protocol from MASTER, the first byte is not '$' (we received '+FULLRESYNC ef395cb5c2ee43444c0222a0f4f92e61b617166c 8883'), are you sure the host and port are right?
[4002] 17 Jul 17:47:55.775 * Connecting to MASTER 10.45.80.10:6379
[4002] 17 Jul 17:47:55.775 * MASTER <-> SLAVE sync started
[4002] 17 Jul 17:47:55.775 * Non blocking connect for SYNC fired the event.
[4002] 17 Jul 17:48:23.195 * Master replied to PING, replication can continue...
[4002] 17 Jul 17:48:23.201 * Partial resynchronization not possible (no cached master)
[4002] 17 Jul 17:48:28.206 # Unexpected reply to PSYNC from master: -Reading from master: Connection timed out
[4002] 17 Jul 17:48:28.210 * Retrying with SYNC...
[4002] 17 Jul 17:48:35.195 # Bad protocol from MASTER, the first byte is not '$' (we received '+FULLRESYNC ef395cb5c2ee43444c0222a0f4f92e61b617166c 8883'), are you sure the host and port are right?
...

redis.conf

daemonize yes
pidfile /var/run/redis/redis.pid
port 6379
tcp-backlog 2048
timeout 300
tcp-keepalive 60
loglevel notice
logfile /var/log/redis/redis.log
syslog-enabled yes
syslog-ident redis
syslog-facility local0
databases 16
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error no
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /var/lib/redis/
appendonly no
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
hll-sparse-max-bytes 3000
activerehashing no
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 120
client-output-buffer-limit pubsub 32mb 8mb 120
hz 10
aof-rewrite-incremental-fsync yes

Any idea what's happening ?

Comment From: mattsta

What are the versions of your servers? (INFO server output?)

Comment From: crashdump

Tested with 2.8.12 and 2.8.13. Same package used for both Master and Slaves.

127.0.0.1:6379> INFO server
# Server
redis_version:2.8.12
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:a38d2e0ef90de81b
redis_mode:standalone
os:Linux 2.6.32-431.20.3.el6.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.4.7
process_id:2418
run_id:c8dc9c54a113f335954ffd02f394c489b6b6a97b
tcp_port:6379
uptime_in_seconds:50761
uptime_in_days:0
hz:50
lru_clock:13164612
config_file:/etc/redis.conf

Don't know if that's of any value, but I run Redis on CentOS 6.5 x64 on ec2, m3.medium instances.

Comment From: mattsta

The arbitrary disconnections look strange:

[4087] 17 Jul 17:46:53.198 # Connection with slave 10.45.80.20:6379 lost.

[4002] 17 Jul 17:46:32.209 # Unexpected reply to PSYNC from master: -Reading from master: Connection timed out

Are your network settings and security groups configured properly?

After the disconnect, you get the "bad protocol" message which looks like the master didn't start with a new connection, but tired to reuse the old buffer from where it left off. I'm not entirely sure why it would do that.

It sounds like it's easily reproducible for you? Can you test your setup in not-AWS to try and isolate if the network is an issue?

Comment From: crashdump

I think my network and security groups are configured correctly as I can establish/start the replication at first. Its when I restart my nodes that it goes all wrong. OK – I'll try to reproduce that on Vagrant.

Comment From: crashdump

I wasn't able to reproduce on vagrant just yet but it seems to happen (on ec2) after a master failover by redis-sentinel.

# Bad protocol from MASTER, the first byte is not '$' (we received '+FULLRESYNC 0766b0d51b75786d75225b3d89ee7337986cac7a 1'), are you sure the host and port are right?

Port are wide open: TCP 1-65535 and UDP 1-65535.

Comment From: humblepilot

Hey @crashdump wondering how was this issue resolved and can you please share the solution .. I m facing similar issue

Comment From: carlosfu

in order to solve quickly, restart the slave. I faced this problem yesterday

Comment From: ynnt

Just had the same issue, slave restart has fixed this issue.

Comment From: life5ign

I got this error when I accidentally configured a master as a slave as well (i.e. I wrote slave configuration to its config file). When I removed that config, everything worked fine.