Redis Problems with reliability of Pub/Sub subscriptions in different Redis clients

Describe the bug

I'm not sure if this is right place to report this issue, because it seems like a problem with Redis clients. But the same issue is present in all clients that I've checked (Lettuce, Redisson, Jedis, go-redis).

In a case of a sudden connection loss Redis clients are not able detect network problems, and will be listening for Pub/Sub messages on a broken TCP connection for hours, making Pub/Sub unusable.

To reproduce

Start a Redis on Host A
Connect to a Pub/Sub using one of the Redis clients from Host B
Block all traffic on Host A to a Redis server using iptables or other tool
Redis client will not discover that the connection is lost.
Now restart Redis on Host A, and restore network traffic.
Redis client will be listening on connection that no longer exist on the server-side.

I've managed to reproduce this behavior using three different Java clients, and go-redis. Ticket for Lettuce with more details: https://github.com/lettuce-io/lettuce-core/issues/1428

Expected behavior

Redis clients subscribed to a Pub/Sub should be able to detect a broken network connection, and reconnect when necessary.

Additional information

The undocumented workaround for this issue is to tweak OS parameters on a client's host: SO_KEEPALIVE, TCP_KEEPIDLE, TCP_KEEPINTVL and TCP_KEEPCNT. It's similar to what redis-cli client is doing in application layer: https://github.com/redis/redis/blob/1c71038540f8877adfd5eb2b6a6013a1a761bc6c/src/redis-cli.c#L908 https://github.com/redis/redis/blob/efb6495a446a92328512f8a66db701dab95fb933/src/anet.c#L95

Is there is any other way of making reliable Pub/Sub subscriptions without changing OS parameters? Shouldn't all Redis clients change socket parameters in application layer like redis-cli?

Comment From: oranagra

Since redis (the server side) is no longer present, I don't presume anything can be done in the server side to mitigate it. It must be something on the client side, either the OS or client library. TCP keepalive seems like the right solution (that's exactly what it was designed for AFAIK).

@yossigo do you see anything that can be done on our side other than document it? (which I'm not sure will help much)

Comment From: yossigo

@oranagra Theoretically we could come up with an application level keepalive mechanism where Redis periodically sends a heartbeat message. This would involve a lot of backwards compatibility issues and I am not sure there's a significant benefit that justifies it.

I think the best we can do is raise awareness to this issue with client maintainers, who should consider setting TCP keepalive by default on Pub/Sub connections.

Comment From: oranagra

if redis is sending keepalive messages it's the client's responsibility to detect that it's dead. maybe instead the client can try to send some PING and detect a write failure when the socket is dead. but i don't see any advantage for all of that over TCP KEEPALIVE.

@itamarhaber do you know where something like that can be documented? and how to bring this to the attention of existing client maintainers?

Comment From: tzickel

This is a general issue with long-lived silent TCP connections, not specific to Redis nor Pub/Sub (What about a blocking operation with infinite timeout like BLPOP, there you can't even send PING but on Pub/Sub you can).

It can happen in many ways, think about a connection pool, where one of the connection has been stalled like above, then the client tries to send a command on that connection, and never receives a response (what is a good timeout for that ?)...

Clients should provide sensible ways to try to mitigate the variety of issues that can arise from this:
When taking a connection from a pool which have not been talked in awhile, to try a PING before using it (redis-py has that which is disabled by default):

https://github.com/andymccurdy/redis-py/blob/master/redis/connection.py#L676

When possible (like in Pub/Sub), send software keepalive PINGs (the problem with that is it depends on how easy / portable is it to send PING once in a while without involving the end user of the library...).
Allow for easy exposing of the OS level keepalive settings (most clients do this in a raw way which is not easy / portable), comparing: where you have to know the options for your OS https://github.com/andymccurdy/redis-py/blob/master/redis/connection.py#L590 vs. Where you just tell it the keepalive interval and it tries to be smart about it. https://github.com/tzickel/justredis/blob/master/justredis/sync/environments/threaded.py#L27
I had lots of strange issues in my code where sometimes some of the Redis connections would just hang for no good reason. It happened quite frequent that I ended enabling client side OS keepalive, which fixed the issue.

Comment From: zth9

It's better to mention this use case in redis client doc and redis client libraries for different languages.