Redis [CRASH] worker thread call RedisModule_ReplicateVerbatim(ctx); trigger redis crash

Crash report

=== REDIS BUG REPORT START: Cut & paste starting from here === 25802:M 12 Oct 2023 02:19:12.590 # === ASSERTION FAILED === 25802:M 12 Oct 2023 02:19:12.590 # ==> blocked.c:624 'server.also_propagate.numops == 0' is not true

Additional information Just add RedisModule_ReplicateVerbatim(ctx); as bellow. cause redis crash.

(svnenv)root@7007e606a60f:/trunk/redis/redis-sentinel-test/redis-7.0.11/src/modules$ diff -u helloblock.c helloblock.c.new --- helloblock.c 2023-10-12 02:23:01.167059976 +0000 +++ helloblock.c.new 2023-10-12 02:23:33.148083078 +0000 @@ -168,6 +168,7 @@ } while (cursor != 0); RedisModule_ReplySetArrayLength(ctx,replylen);

RedisModule_ReplicateVerbatim(ctx); RedisModule_FreeThreadSafeContext(ctx); RedisModule_UnblockClient(bc,NULL); return NULL;

It only happens against redis 7.x. Not against redis 6.x. So the question is if it is redis bug or RedisModule_ReplicateVerbatim is not expected to be invoked in worker thread since redis 7?

Comment From: guybe7

@zh1029 this is the correct way to do it inside a thread:

diff --git a/src/modules/helloblock.c b/src/modules/helloblock.c
index dc3d74975..e5f7eac9f 100644
--- a/src/modules/helloblock.c
+++ b/src/modules/helloblock.c
@@ -168,6 +168,10 @@ void *HelloKeys_ThreadMain(void *arg) {
     } while (cursor != 0);
     RedisModule_ReplySetArrayLength(ctx,replylen);

+    RedisModule_ThreadSafeContextLock(ctx);
+    RedisModule_ReplicateVerbatim(ctx);
+    RedisModule_ThreadSafeContextUnlock(ctx);
+
     RedisModule_FreeThreadSafeContext(ctx);
     RedisModule_UnblockClient(bc,NULL);
     return NULL;

@oranagra this is somewhat of a breaking change compared to 6.2 - should we document it somewhere?

Comment From: oranagra

You mean that before 7.0 it was possible to replicate without locking, and now it's not? We can clearly state it in the api docs. However, If it wasn't documented that the api is valid without locking, then maybe the fact it used to work because how the internals are laid out, is not a reason to consider that a breaking change? I.e. if it was implicitly "documented" that it's an invalid use.

Comment From: zh1029

Yes. It is verified that no crash against redis 6.x.

Comparing redis 6.x and redis 7.x code. The difference is below two lines are added in function handleClientsBlockedOnKeys()

**serverAssert(server.also_propagate.numops == 0);
server.core_propagates = 1;**

Which triggers crash from that code.

So better to be documented for the API usage change.

Comment From: sundb

this is a side effect of #9890. before #9890, when we're in the thread-safe context, we'll propagate immediately, rather than waiting util the end. @oranagra @guybe7 do we consider just documenting it or avoiding this side effect?

    if (ctx->flags & REDISMODULE_CTX_THREAD_SAFE) {
        propagate(ctx->client->db->id,argv,argc,target);
    } else {
        moduleReplicateMultiIfNeeded(ctx);
        alsoPropagate(ctx->client->db->id,argv,argc,target);
    }

Comment From: oranagra

is there any reason to use it like that? i.e. any reason to call replicateVerbatim from a thread? maybe before blocking the module can't decide yet if it wants to replicate? if there's no reason for that, we can document it. if there is any reason to use it like so, does the proposed fix has any downsides?

Comment From: sundb

@oranagra RM_Replicate* has never been thread-safe even before #9890, it just wasn't exposed. whether a module api is thread-safe has always been ambiguous, just like #13010, should we add a new field to redis-doc to indicate whether an api is thread-safe or not, otherwise we'll just have to add whether it's thread-safe or not to the top of each api.

Comment From: oranagra

i don't think i understand what new field you refer to. i think we should add a line to document the ones that are designed to be thread safe, and leave the rest undefined.

Comment From: sundb

Redis [CRASH] worker thread call RedisModule_ReplicateVerbatim(ctx); trigger redis crash Adding a new Thread Safe filed, i think not all the module developers will come to see the implementation.

Comment From: oranagra

i didn't mean they need to see the implementation, i meant we'll mention it in the free text of the doc (not as a field), and just for the ones we know are designed to be thread safe. for the others, the module developers will need to assume they're not. i think adding that flag per command is a lot more work, and i think that in some cases it's not just a boolean thing, it'll come with a set of conditions or restrictions.

Comment From: sundb

@oranagra A side effect #9890 is that it changes the behavior of RM_Replicate in thread-safe ctx. now it will always be wrapped in multi/exec, instead of being passed out immediately. i can't see from the history why RM_Replicate was designed that way, and the tests also don't explain why it was needed.

as the documentation mentions:

 However when calling this function from a threaded safe context
 * that can live an undefined amount of time, and can be locked/unlocked in
 * at will, the behavior is different: MULTI/EXEC wrapper is not emitted
 * and the command specified is inserted in the AOF and replication stream
 * immediately.

Comment From: oranagra

now it will always be wrapped in multi/exec

@sundb sorry for circulating to this just now. are you saying that besides of whether or not this function is thread safe (i.e. can be used without locking), it also had misleading (outdated) documentation? i.e. the MULTI/EXEC wrapper is not emitted part is wrong?

Comment From: sundb

@oranagra yes, i create a PR to update their document on thread safe and MULTI/EXEC wrapper.