Redis Redis in Docker Swarm: master fails to reconnect to replica after a while

I'm having difficulties getting redis to work in a Docker Swarm setup. At first it works, but after a while (probably after a service restart), I'm getting these errors:

03 Mar 2020 13:41:09.748 * Connecting to MASTER redis-master:6379
03 Mar 2020 13:41:09.749 * MASTER <-> REPLICA sync started
03 Mar 2020 13:41:09.749 # Error condition on socket for SYNC: Connection refused
03 Mar 2020 13:41:10.751 * Connecting to MASTER redis-master:6379
…

These go on forever.

I'm using a docker-compose stack. I have 1 master, and replicas on each of the 3 servers. The setup doesn't use sentinels. My thinking is that if the master fails, docker restarts the service and it reads the config back in via the shared volume that is used by masters and replicas. Relevant parts:

redis-master:
    image: "${CI_REGISTRY_IMAGE}:redis_master-${CI_COMMIT_REF_SLUG}"
    networks:
      - mynetwork
    volumes:
      - redis:/opt/scripts
    ports:
      - 6379:6379
    command: sh -c 'redis-server /usr/local/etc/redis/redis.conf --bind $$(hostname -i)'
    deploy:
      replicas: 1
      update_config:
        parallelism: 1
        delay: 10s
        order: stop-first
        failure_action: rollback
      rollback_config:
        parallelism: 1
        delay: 10s
        order: stop-first
      restart_policy:
        condition: on-failure
        delay: 10s
        max_attempts: 5
        window: 180s
    healthcheck:
      test: /usr/local/bin/healthcheck.sh
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 1m

  redis-replica:
    image: "${CI_REGISTRY_IMAGE}:redis_replica-${CI_COMMIT_REF_SLUG}"
    networks:
      - mynetwork
    volumes:
      - redis:/opt/scripts
    ports:
      - 6380:6380
    command: sh -c 'redis-server /usr/local/etc/redis/redis.conf --bind $$(hostname -i) --replica-announce-ip $$(hostname -i) --port 6380 --replicaof redis-master 6379'
    depends_on:
      - redis-master
    deploy:
      mode: global
      update_config:
        parallelism: 1
        delay: 10s
        order: stop-first
        failure_action: rollback
      rollback_config:
        parallelism: 1
        delay: 10s
        order: stop-first
      restart_policy:
        condition: on-failure
        delay: 10s
        max_attempts: 5
        window: 180s
    healthcheck:
      test: /usr/local/bin/healthcheck.sh
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 1m

I have tested this setup and it seems to work. I've also tested rebooting each of the 3 servers, and after a while Redis connects to all of the instances fine, so they seem to find each other.

But, after a while this breaks. I don't know exactly when, by the time I notice it, my log is full of reconnecting messages.

If I bind to 0.0.0.0, everything seems to go well (at least for longer periods of time), but this puts my database wide open, so that's not feasible. I have a feeling the problem has something to do with the binding, or a restart of the service gets a new IP or something, I don't know.

Any help much appreciated!

Comment From: Monokai

Update. I now did change the bind to 0.0.0.0, and used expose: 6379 instead of ports: "6379:6379" to expose the port to other services in the overlay network without mapping the port of the container to the host.

Again, everything looks OK and after a while I'm now getting:

Mar 2020 10:52:21.924 # Unable to connect to MASTER: Resource temporarily unavailable
Mar 2020 10:52:22.927 * Connecting to MASTER redis-master:6379
Mar 2020 10:52:22.928 # Unable to connect to MASTER: Resource temporarily unavailable
Mar 2020 10:52:23.931 * Connecting to MASTER redis-master:6379
…

It repeats every second

Comment From: Monokai

Update. This might have been an out-of-memory error. I've upgraded the server to include more memory and I haven't got any issues since two weeks.

I would like to hear if this Docker Swarm setup is the right way to go or if I need to put Redis in cluster mode or something. It's all a bit vague to me how a Redis master / replica setup ideally should be used in a Docker Swarm setup. I only use Redis for caching purposes and it doesn't matter all that much if some data is lost due to restarting a Redis instance.