Redis redis 6.0 doesn't free clients that disconnect during loading

@antirez In 6.0 you changed readQueryFromClient to call freeClientAsync rather than freeClient (it was for threaded IO).

so although redis 6.0 now calls freeClientsInAsyncFreeQueue from beforeSleep rather than serverCron, neither of them is called while loading.

this means that now if a client connects and drops during loading, it is added to clients_to_close and remains alive (listed as a client, and socket not closed) until loading is done.

if there are some stats monitoring tools and watchdogs sampling the loading progress by re-opening and closing the connection repeatedly, over time, on a long loading, thousands of connections accumulate. (i have circumstantial evidence showing that this can also slow down the loading process, i'll post details in the next post)

it is interesting to note that processEventsWhileBlocked does call afterSleep (4 times), but does not call beforeSleep. this looks very odd, so one possible solution is to move the call to beforeSleep from aeMain to aeProcessEvents (right before aeApiPoll and afterSleep).

we need to carefully consider what side effects such a change can cause, and if we can't afford it, maybe just call freeClientsInAsyncFreeQueue from processEventsWhileBlocked (which is very very specifically crafted function anyway).

it is important to note however, that in some cases (diskless replica) processEventsWhileBlocked is called from within a read event handler (readSyncBulkPayload), but i suppose it won't detect the master disconnected and try to free it from that call path, so maybe it's still ok.

Comment From: oranagra

some possibly related side effects: i watched a case were a loading of an RDB (ROF) that took about a minute to generate, took hours to load. during the long loading, i saw thousands of clients connected to the slave, and a random sampling of the stack trace (trying to figure out the reason for slowness) showed that many times it was inside processEventsWhileBlocked -> connRead so i don't yet fully understand it, but it looks like the massive list of clients also cause some additional slowdown, and indeed loading_eta_seconds keeps growing and growing (rather than shrinking).

Comment From: oranagra

Update: The reason for the the slowness is because epoll is used with a level-trigger mode, so when we get the read error, but not close the connection, each time we'll poll it again, we'll get a read event again, and end up trying to read again each of the closed connections.

The mention of a crash in an earlier versions of this issue was a false positive (WD killed the process with SIGSEGV).