Redis [BUG] Replica Sentinels become unresponsive shortly after startup

Describe the bug

I'm having a problem with Redis Sentinel where the Sentinel servers started up after the master become unresponsive and the master marks them with sdown. Sentinel 1 starts normally, but shortly after Sentinel 2 finishes starting and syncs, Sentinel 1 reports it down. When this happens, I cannot connect to Sentinel 2 via redis-cli -- it hangs indefinitely. I've checked the debug logs and haven't found anything definitive to explain this behavior.

I considered that tile mode and clock sync could be the problem, but all of the other clusters' Sentinels log the same messages. Additionally, all VMs are synchronized to the same time source.

To reproduce I'm able to replicate this behavior on the official Redis image, as well as the Bitnami image. It is extremely consistent

I have a three-node Redis cluster on three virtual machines. All are running Redis and Sentinel in a pod of containers via podman. All hosts are RHEL 8.6.

VM 1 - Redis Master, Sentinel VM 2 - Redis Replica, Sentinel VM 3 - Redis Replica, Sentinel

I start the pod on VM 1 and it operates normally.

Then I start the pod on VM 2. The Sentinels recognize each other and connect, then after a short time, VM 1 reports the Sentinel down on VM 2:

LOGS ON VM 1: Feb 06 12:02:30 {VM 1} redis-sentinel[1934224]: 1:X 06 Feb 2024 17:02:30.194 * +sentinel sentinel 6d1e0791d25f14c982de3101a615851f4809da93 {VM 2 FQDN} 2> Feb 06 12:02:31 {VM 1} redis-sentinel[1934224]: 1:X 06 Feb 2024 17:02:31.204 * +slave slave {VM 2 FQDN}:6379 {VM 2 FQDN}6379 @ {MASTER NAME} {VM 1 FQDN} 6379 Feb 06 12:02:31 {VM 1} redis-sentinel[1934224]: 1:X 06 Feb 2024 17:02:31.205 . Rewritten config file (/etc/sentinel/sentinel.conf) successfully Feb 06 12:02:31 {VM 1} redis-sentinel[1934224]: 1:X 06 Feb 2024 17:02:31.206 * Sentinel new configuration saved on disk ... Feb 06 12:03:13 {VM 1} redis-sentinel[1934224]: 1:X 06 Feb 2024 17:03:13.345 # +sdown sentinel 6d1e0791d25f14c982de3101a615851f4809da93 {VM 2 FQDN} 2637>

LOGS ON VM 2: eb 06 12:02:30 {VM 2} redis-sentinel[1887535]: 1:X 06 Feb 2024 17:02:30.194 * +sentinel sentinel aa8ef85d4f22fb83d91a66edd78a5fdc245266f9 {VM 1 FQDN} 2> Feb 06 12:02:30 {VM 2} redis-sentinel[1887535]: 1:X 06 Feb 2024 17:02:30.196 . Rewritten config file (/etc/sentinel/sentinel.conf) successfully ... Feb 06 12:02:42 {VM 2} redis-sentinel[1887535]: 1:X 06 Feb 2024 17:02:42.281 # +tilt #tilt mode entered Feb 06 12:03:02 {VM 2} redis-sentinel[1887535]: 1:X 06 Feb 2024 17:03:02.379 # +tilt #tilt mode entered

Expected behavior

I am running many Redis clusters in this identical fashion, but only two of them have had this problem so far. I would expect the Sentinels to remain up and in communication.

Additional information

Redis replication works fine throughout. Replicas will sync with the Master without any issues.

Here are the Redis and Sentinel configuration files I'm using:

redis-master.conf.txt sentinel-master.conf.txt redis-replica.conf.txt sentinel-replica.conf.txt

Comment From: bkienker

I have been able to obtain a stack trace from redis-sentinel running within the container. In this case, the instance on VM 2 is blocking, so here it is connecting to VM 1's Sentinel:

#0 0x00007f388162f9af in __GI___poll (fds=fds@entry=0x7ffc9317ab98, nfds=nfds@entry=1, timeout=999, timeout@entry=<error reading variable: That operation is not available on integers of more than 8 bytes.>) at ../sysdeps/unix/sysv/linux/poll.c:29 #1 0x00007f388150e260 in send_dg (statp=0x7f3881712c40 <_res>, buf=<optimized out>, buflen=<optimized out>, buf2=0x7ffc9317ab60 "'z\323e", buflen2=<optimized out>, ansp=<optimized out>, anssizp=<optimized out>, terrno=<optimized out>, ns=<optimized out>, v_circuit=<optimized out>, gotsomewhere=<optimized out>, anscp=<optimized out>, ansp2=<optimized out>, anssizp2=<optimized out>, resplen2=<optimized out>, ansp2_malloced=<optimized out>) at res_send.c:1151 #2 0x00007f388150ef89 in __res_context_send (ctx=ctx@entry=0x5637072d4720, buf=buf@entry=0x7ffc9317ade0 "\332\362\001", buflen=44, buf2=buf2@entry=0x7ffc9317ae0c "\234\361\001", buflen2=buflen2@entry=44, ans=<optimized out>, ans@entry=0x7ffc9317b5d0 "\332\362\205\200", anssiz=<optimized out>, ansp=<optimized out>, ansp2=<optimized out>, nansp2=<optimized out>, resplen2=<optimized out>, ansp2_malloced=<optimized out>) at res_send.c:530 #3 0x00007f388150bd4a in __GI___res_context_query (ctx=ctx@entry=0x5637072d4720, name=name@entry=0x7f3881171221 "<VM 1 FQDN>", class=class@entry=1, type=type@entry=439963904, answer=answer@entry=0x7ffc9317b5d0 "\332\362\205\200", anslen=anslen@entry=2048, answerp=0x7ffc9317be20, answerp2=0x7ffc9317be28, nanswerp2=0x7ffc9317be10, resplen2=0x7ffc9317be14, answerp2_malloced=0x7ffc9317be18) at res_query.c:216 #4 0x00007f388150c9bf in __res_context_querydomain (answerp2_malloced=0x7ffc9317be18, resplen2=0x7ffc9317be14, nanswerp2=0x7ffc9317be10, answerp2=0x7ffc9317be28, answerp=0x7ffc9317be20, anslen=2048, answer=0x7ffc9317b5d0 "\332\362\205\200", type=439963904, class=1, domain=0x0, name=0x7f3881171221 "<VM 1 FQDN>", ctx=0x5637072d4720) at res_query.c:601 #5 __GI___res_context_search (ctx=ctx@entry=0x5637072d4720, name=name@entry=0x7f3881171221 "<VM 1 FQDN>", class=class@entry=1, type=type@entry=439963904, answer=answer@entry=0x7ffc9317b5d0 "\332\362\205\200", anslen=anslen@entry=2048, answerp=<optimized out>, answerp2=<optimized out>, nanswerp2=<optimized out>, resplen2=<optimized out>, answerp2_malloced=<optimized out>) at res_query.c:370 #6 0x00007f3881521d0a in _nss_dns_gethostbyname4_r (name=name@entry=0x7f3881171221 "<VM 1 FQDN>", pat=pat@entry=0x7ffc9317bf78, buffer=0x7ffc9317c280 "\177", buflen=1024, errnop=errnop@entry=0x7f388153b6a0, herrnop=herrnop@entry=0x7f388153b704, ttlp=<optimized out>) at nss_dns/dns-host.c:372 #7 0x00007f3881623926 in gaih_inet (name=<optimized out>, name@entry=0x7f3881171221 "<VM 1 FQDN>", service=service@entry=0x0, req=req@entry=0x7ffc9317c6e0, pai=pai@entry=0x7ffc9317c178, naddrs=naddrs@entry=0x7ffc9317c174, tmpbuf=tmpbuf@entry=0x7ffc9317c270) at ../sysdeps/posix/getaddrinfo.c:765 #8 0x00007f38816247a5 in __GI_getaddrinfo (name=<optimized out>, service=<optimized out>, hints=0x7ffc9317c6e0, pai=0x7ffc9317c6d8) at ../sysdeps/posix/getaddrinfo.c:2256 #9 0x000056370593f4d0 in anetResolve (err=0x0, host=0x7ffc9317ab98 "\n", ipbuf=0x7ffc9317c740 "\230\307\027\223\374\177", ipbuf_len=46, flags=2) at anet.c:254 #10 0x0000563705a012ec in createSentinelAddr (hostname=0x7f3881171221 "<VM 1 FQDN>", port=26379, is_accept_unresolved=1) at sentinel.c:566 #11 0x0000563705a02fef in getSentinelRedisInstanceByAddrAndRunID (instances=0x7f38810083b8, addr=0x1 <error: Cannot access memory at address 0x1>, port=1, runid=0x7f3881026053 "f7ffde5682532c1643dbb4d8439fd121cfd7561c") at sentinel.c:1489 #12 0x0000563705a0930a in sentinelProcessHelloMessage (hello=0x7ffc9317ab98 "\n", hello_len=1) at sentinel.c:2876 #13 0x0000563705a097f8 in sentinelPublishCommand (c=0x7f3881163180) at sentinel.c:4530 #14 0x0000563705951740 in call (c=0x7f3881163180, flags=1) at server.c:3519 #15 0x00005637059529e9 in processCommand (c=0x7f3881163180) at server.c:4160 #16 0x0000563705976b37 in processCommandAndResetClient (c=0x7f3881163180) at networking.c:2466 #17 processInputBuffer (c=0x7f3881163180) at networking.c:2574 #18 0x00005637059770a0 in readQueryFromClient (conn=0x1d94) at networking.c:2713 #19 0x0000563705a69fd8 in callHandler (handler=0x1, conn=0x7f3881029280) at connhelpers.h:79 #20 connSocketEventHandler (el=0x7ffc9317ab98, fd=1, clientData=0x7f3881029280, mask=0) at socket.c:298 #21 0x0000563705947d09 in aeProcessEvents (flags=-2130534080, eventLoop=0x1a0) at ae.c:436 #22 aeMain (eventLoop=0x7f388102a140) at ae.c:496 #23 0x000056370593cecd in main (argc=-1827165288, argv=0x7ffc9317cc08) at server.c:7360

Here's where I focused: #11 0x0000563705a02fef in getSentinelRedisInstanceByAddrAndRunID (instances=0x7f38810083b8, addr=0x1 <**error: Cannot access memory at address 0x1**>, port=1, runid=0x7f3881026053 "f7ffde5682532c1643dbb4d8439fd121cfd7561c") at sentinel.c:1489

Unfortunately, I'm not sure why this is happening. Thanks in advance for any advice or assistance.

Comment From: bkienker

After some additional trial and error, I have discovered that the Sentinels are able to communicate and operate normally when I use IP addresses and disable both resolve-hostnames and announce-hostnames -- basically using IP addresses only.

I have no be able to find any inconsistencies in my DNS configuration, as this replica set is configured identically to all of the others that I operate. However, if this points to a DNS issue, I will continue to chase that lead.

Comment From: bkienker

So this turned out to be a problem with our local DNS servers not responding to incoming TCP requests (UDP was working, so it was difficult to observe DNS query failures). No problem with Sentinel at all, sorry for the trouble.