Redis [BUG] Potential Error Handling Bug in connSocketBlockingConnect When aeWait Fails

Describe the bug

There appears to be a potential issue in the connSocketBlockingConnect function, where a failure of the aeWait function (due to timeout or other reasons) does not result in the function returning C_ERR. The function sets the connection state to CONN_STATE_ERROR and the last error to ETIMEDOUT, but then proceeds to assign the file descriptor and state as if the connection was successful, without returning an error code.

To reproduce

This issue is related to the internal error handling logic and may not be easily reproducible without modifying the source to simulate an aeWait failure. The relevant code section is:

if ((aeWait(fd, mask: AE_WRITABLE, milliseconds: timeout) & AE_WRITABLE) == 0) {
    conn->state = CONN_STATE_ERROR;
    conn->last_errno = ETIMEDOUT;
    // Potentially missing return C_ERR here
}

Expected behavior

When aeWait fails, whether due to a timeout or any other reason, the expected behavior would be for connSocketBlockingConnect to return C_ERR, indicating the connection attempt was unsuccessful. This would be in line with the handling of the initial connection attempt, where a failure to connect does return C_ERR.

Additional information

The issue was observed in the connSocketBlockingConnect function where the second if condition checks for writability using aeWait.
The lack of a return statement following the setting of CONN_STATE_ERROR may result in the function incorrectly proceeding as if the connection was established.

Comment From: zuiderkwast

I agree, it looks like a typo. Just below this we have the following code:

conn->fd = fd;
conn->state = CONN_STATE_CONNECTED;
return C_OK;

Do you want to open a PR?

I wonder if it's possible to test it in a reliable way. connBlockingConnect seems to be used only in the MIGRATE command (cluster.c). But if connect succeeds, I believe the socket is always writable since the OS has a buffer and can read some data even if the application doesn't read it immediately. Is this actually dead code?

Comment From: ganyyy

I agree, it looks like a typo. Just below this we have the following code:

conn->fd = fd; conn->state = CONN_STATE_CONNECTED; return C_OK;

Do you want to open a PR?

I wonder if it's possible to test it in a reliable way. connBlockingConnect seems to be used only in the MIGRATE command (cluster.c). But if connect succeeds, I believe the socket is always writable since the OS has a buffer and can read some data even if the application doesn't read it immediately. Is this actually dead code?

Thank you for your response and for acknowledging the possibility of a typo.

Regarding your point about the socket always being writable after a successful connect, my understanding and tests suggest that while this is often the case, it may not always hold true under certain conditions, such as when connecting to a non-existent external IP address. In my testing, I found that setting connect to non-blocking mode and specifying a short timeout could result in a failure due to timeout, in which case the socket would not become writable.

Specifically, in the connSocketBlockingConnect function, when the aeWait call returns 0 (indicating a timeout), the current implementation does not immediately return C_ERR, but continues execution, potentially leading to inconsistent connection states. In my local tests, I observed different error handling behaviors depending on whether the function returns C_ERR or not after aeWait reports a timeout error.

Based on these observations, I believe that modifying this part to ensure the correct return of C_ERR in case of timeouts and other errors would be meaningful. This would help ensure the function behaves as expected and can properly handle timeouts and other potential connection errors.

As for reliably testing this scenario, it indeed poses a challenge as it requires simulating specific network behaviors, like connecting to a non-existent address or simulating a timeout. However, I think it should be possible to construct such test scenarios by setting up specific network conditions or using network simulation tools.

If you think it's appropriate, I can open a PR to address this issue.

Comment From: zuiderkwast

Good. Yes at least we should add the return C_ERR. You have a good explanation. Maybe we can write a test but if it is hard, maybe we can skip it.

Comment From: zuiderkwast

Let's keep it open until the PR is merged. It will be closed automatically when the PR is merged.