Redis [NEW] Avoid allocations for RedisModuleString

The problem/use-case that the feature addresses

Avoid allocations in modules, when passing strings to/from module API.

Description of the feature

Add a new string representation containing a plain C-string, which is not owned by the RedisModuleString allocation and thus shall be treated as a const char*. Allow RedisModuleString to have this representation.

This const char pointer can be pointing into internal memory inside the database, without copying. All (at least many) string functions in object.c need to handle the new encoding. Some functions in module.c need to be updated.

How to store the length? See Additional info below. (strlen() can be used for nul-terminated strings, but som internal strings, e.g. inside ziplists, are not nul-terminated.)

There's also a risk in allowing non-nul-terminated strings (a comment in sds.c says "The string is always null-termined (all the sds strings are, always)") and thus the pointers you get from RedisModule_StringPtrLen() are nul-terminated up to now.

~~Strings passed from a module to the API function can easily be constructed as in Example 1 below.~~

~~Strings returned from API functions as pointers would need to be allocated somewhere and only be valid until the next call, for example as in RM_Foo() in example 2.~~

Alternatives you've considered (if this feature is not implemented)

Allow modules to avoid creating allocated strings by adding module API functions operating on plain old const char *str and size_t len.

Additional information

@oranagra wrote:

robj is 16 bytes (on 64bit systems), has encoding, refcount and lru which are completely useless for our cause.

you may not like it, but here's my proposals:

create a new type of for the type field: #define OBJ_STRING_PTR 15, and use the encoding, refcount and lru fields bits for the length (60 bits), and the ptr as the data.
RedisModuleString is always a heap allocation, so we can set the LSB of the pointer we return to the module to 1 when the string is RedisModuleStaticString allocation rather than an robj, in which case we cast it to that type, a struct which has size_t len, and void* ptr (also 16 bytes on 64bit systems)
if we don't like using the LSB of the pointer we hand to the user, we can reserve one bit in the robj's type member instead, and store lengths of up to 63 bits (note that we can't use a bit of the ptr member, since these are not always heap allocation pointers).

i prefer option 2, considering RedisModuleString is an opaque type, the modules don't need normally need to care how me manage it. although they will need to know that some operations are impossible with some strings, like they already have (not being able to RM_RetainString OBJ_STATIC_REFCOUNT

Examples

Example 1 (idea only -- ignore namings), module passes string to API:

[Edit] Doesn't work---module doesn't know the sizeof RedisModuleString.

int set_mykey_to_hello(RedisModuleContext *ctx, RedisModuleString **argc, int argc) {
    /* Create a RedisModuleString image of a char array, for use in API  */
    RedisModuleString s;
    RedisModule_InitConstNultermString(&s, "my-key");
    RedisModuleKey *key = RedisModule_OpenKey(ctx, &s, REDISMODULE_WRITE);
    RedisModule_InitConstNultermString(&s, "hello");
    RedisModule_StringSet(key, &s);
    return RedisModule_ReplyWithString(ctx, &s);
}

Example 2, API returns string to module:

static RedisModuleString foo_str;

/* The returned string is only valid until the next call to RM_Foo(). */
RedisModuleString *RM_Foo() {
    RedisModule_InitConstNultermString(&foo_str, "some-internal-string");
    return &foo_str;
}

Comment From: oranagra

we can't afford the module to create any of these string objects on the stack and pass them to some API to be filled, doing so means that this is no longer an opaque type (the module needs to know the size of the struct in order to allocate it on the stack.
i think the small heap allocation of a 16 bytes struct is something we can live with (we wanna avoid copying the actual string bytes), but creating a small heap allocation to store the pointer and the length seems like an acceptable option to me.
redis should never assume these are null terminated strings. they can always be just binary blobs with many null charts in them, or none. when redis give the char to the user it'll always provide a length, and when the user give that char to an API he should provide a length he can use strlen if he likes, but i don't think redus should.

Comment From: oranagra

re-opening the discussion on my last argument about opaque structs, we have two main options: 1. create a new distinct type like RedisModuleStaticString which is not opaque (modules know it has a char* and a size_t members), and these can be allocated on the stack of the module, and the module doesn't even need to ask redis to populate them, it can do it on it's own. 2. keep the existing RedisModuleString opaque type, in which case it can never be places on the module's stack (it doesn't know the size of the struct)

the disadvantage of option 1, is that it's a new type, and it cannot be passed to all the exiting APIs that take RedisModuleString. this is why i think we must keep the type opaque, and in that case it can always be a heap allocation, and redis can use the MSB of the pointer as an indication if that's an robj with an sds, or a struct with size_t and char*.

Comment From: zuiderkwast

OK, so one allocation is better than many. It will be worth benchmarking whether RM_Call("LPOS") is faster than repeated calls to RM_ListIteratorNext() if the latter allocates 16 bytes for the return value each time. As long as LPOS can iterate without allocation, it will always be faster/cheaper for large lists.

The idea is still good, since existing API functions can be optimized to use this new static string type.

Comment From: zuiderkwast

What if we add the new non-opaque RedisModuleStaticString and let the module populate it. Then we add a fancy abstracted casting function...

RedisModuleString *RM_StringViewOfStaticString(RedisModuleStaticString *st) {
    flip_some_bits(st); /* see [Edit] below */
    return (RedisModuleString *)st;
}

... then RedisModuleString is still opaque. :-D

[Edit] If we make it destructive, it could work. E.g. we shift the size field to encode stuff to distinguish it from robj.

Comment From: oranagra

i think the important part is that we don't copy the string.. the extra 16 bytes allocations is not the heavy part. the heavy part is allocating and memcpying the string itself, which can be megabytes and such.

so about your last proposal, some APIs take the new non-opaque RedisModuleStaticString struct which the module can create on the stack, and for the other APIs, we change the opaque RedisModuleString in some way so that similarly to my proposal, it can hold either an robj or a RedisModuleStaticString? i.e. this way the existing APIs can take the new type, without needing to copy the content into a normal RedisModuleString?

Comment From: zuiderkwast

Well, yes once you have managed to cast it to a RedisModuleString *, it can be used in all APIs requiring it. It's ugly... but I'm trying to find a way, haha. It should look like a RedisModuleString in the end, that's the idea. I think your idea of accepting the 16 bytes allocation is better.

Yet another idea: We give RedisModuleString a size which is large enough: typedef RedisModuleString char[16]. Only the size is visible, the rest is opaque, so it can be on the stack and be initialized with a function.

Comment From: zuiderkwast

Let's say we go for only using RedisModuleString * in the API and use the LSB bit for tagging. I can implement it unless you prefer I finish something else first?

Future idea: Bit tricks on pointers can be abused a lot to encode various things. We have at least 2 alignment bits, so we could use the other one to store short strings inside the pointer itself (up to 7 bytes long for 64 bit and 3 bytes for 32 bit platforms), using the 2nd LSB for tagging.

(Some perspective: On x86-64, only 48 of the 64 bits in a pointer are actually used. The rest can be combined with NaN-boxing to store a double inside the pointer, as well as small integers and other stuff. These tricks are often used in dynamically typed language runtimes, such as JavaScript. An encoding scheme I made once inspired by these, can store small integers, doubles and up to 6 bytes long strings.)

Comment From: yossigo

I'm in favor of adding a type and encoding the length in the rest of the robj, rather than trying to do pointer encoding tricks which are not really needed here and do come with some penalties (e.g. valgrind).

This alone should handle Redis->Module const RedisModuleString pointing to arbitrary memory buffers. This is mostly but not totally backwards compatible and we should identify the cases where it's not. Stack allocation is not an issue because it's Redis->Module.

For the Module->Redis case I am in favor of remaining with a single RedisModuleString opaque type and providing an initializer function: * Initially we can do naive heap allocations * If we prove that in some tight loops allocations have a big impact, we can rely having a RedisModuleCtx everywhere and reserve a bit of extra space for a mini slab allocator of RedisModuleStrings in it. We'll just need to make sure the initializer function takes a RedisModuleCtx.

Comment From: oranagra

ok. so 60 bits for length. We can keep all our existing APIs, but we may need to handle the new case in many of them. when the module gets a call a redis API that returns a string it can be this new type of string that refers to the internal memory, and then the module adds it to the reply buffers we need to be able to handle it correctly. In addition to that the module can't RM_Retain that string, but it can RM_Hold it. And we'll add a new API that let's the module create such strings representing it's internal memory too.

The next challenge may be to know which modules can handle these strings, or which APIs can return them. i.e. it's maybe save to use them in new APIs since they'll only be used by new modules, but can we also use them in old APIs too (like RM_HashGet)? maybe we need the module to declare capability to support these?

Comment From: zuiderkwast

Good point about valgrind. Probably they should always be returned as const RedisModuleString * from the API and all APIs where they're already declared as const should work. Maybe we can hurry and add const in the Stream API before 6.2 is released... As input (Module->Redis), I think they can be handled in all cases.

It would be nice if object.c can handle them generally, to cover all cases where module.c calls into the rest of Redis, e.g. compareStringObjects(), lookupKeyReadWithFlags() and even the dict lookup functions and such, but I don't know it's that's feasible. Or should we limit this encoding to RedisModuleString and module.c? (Most cases are easy to handle, e.g. RM_ReplyWithString() can just call addReplyBulkCString() with ptr and length for this kind of strings.)

We could call them "borrowed strings" since the struct doesn't own the content.

Comment From: zuiderkwast

And it would be very nice if e.g. RM_HashGet could return them. Just saying.

[Edit] Any module that calls RedisModule_StringAppendBuffer() on a string returned by the API would break, although the doc already forbids it ("The string must be a string created by the user that is referenced only a single time"). Perhaps it's a good idea to require a module option/capability or a ctx flag to allow RM_HashGet() and others to return these strings.

Comment From: oranagra

btw, Valgrind is not really an issue, redis is already holding pointers to the middle of an allocation (all sds do that), we just run valgrind with --show-possibly-lost=no (we do that anyway).

Maybe the advantage of using the spare bits of robj rather than using the MSB to detect that it's a different struct, is that maybe indeed we can support this type of robj string in deeps parts of redis (like compareStringObjects, lookupKeyReadWithFlags which you mentioned) rather than just in module.c

the disadvantage is that it'll be harder to get rid of that dreaded robj one day. not that it is used in many many places in redis today, and much of it's content is completely useless (i.e. the type and LRU bits are not used anywhere except for the main dict). and the refcount can arguably one day move to the sds, and be removed from the robj.

the reason we can't use these automatically in RM_HashGet is that old modules may call RM_RetainString on the result and that would break them.

Comment From: zuiderkwast

Would it be horrible to let RetainString silently replace the contents of the RedisModuleString with an allocated copy, converting it to a regular robj?

Comment From: oranagra

i suppose it's an option if the original robj string is a heap allocation (it might be problematic but i can't yet put my finger on the problem). it won't be possible if the original robj string is allocated on the stack, but i suppose these will have a very limited use and only in new callbacks.

this is actually not a new concern, we already have some callbacks today that are using stack allocated robj. here are two examples:

moduleNotifyKeyspaceEvent(NOTIFY_LOADED, "loaded", &keyobj, db->id);

also applies for RM_GetKeyNameFromIO

Comment From: zuiderkwast

Interesting. Then, I can see one possibility to tell them apart: Use one bit in the robj->type. So one type (15) for heap allocated borrowed strings and another (14) for shared/static/stack borrowed strings. We can start with the heap allocated ones.

Access using o->ptr works for both sds and these, and the length can be abstracted as stringObjectLength(robj *o) which would work for both sds and these (but not for the OBJ_ENCODING_INT ones). Maybe it would be good to have isStringObject(robj *) as well.

One more way to avoid allocations would be to allow RedisModuleString with OBJ_ENCODING_INT when returning integer values from e.g. HashGet and when creating strings with RM_CreateStringFromLongLong.

Comment From: oranagra

there's another complication. i spoke with @MeirShpilraien and he said many modules rely on the fact sds makes sure to null-terminate it's strings, and may pass only the char down some call stack to be processed. so if we take a string from a ziplist, we char and length but no null terminator. that means that if there is an old module that was compiled against redis 4.0, we may rely on the opaqueness of RedisModuleString to and all the updated code in redis handling static string.

i can think of two general directions to handle that problem. 1. don't return these static strings for old (pre-existing APIs), or have the module declare if it supports these string or not. 2. change our ziplists to be null terminated as well. (increase memory usage)

Comment From: zuiderkwast

I vote for "have the module declare if it supports these string or not", making it an opt-in optimization. Modules are not important enough to justify suboptimal ziplists.

Another alternative: Let RM_StringPtrLen() silently replace the contents with an allocated sds-string.

Comment From: yossigo

@oranagra AFAIR the API never explicitly guaranteed RedisModuleString to be null terminated. That means those modules are buggy (they'd also fail today if strings have nulls in them) - do we really want to complicate the API further for this reason?

Comment From: zuiderkwast

I've started experimenting with this.

It's seems dangerous to return pointers into ziplists. Once you touch the same ziplist (e.g. set a field in the same hash, do another list push, etc.) the RedisModuleString becomes invalid.

Example: Set a hash field and return the old value:

    if (RedisModule_HashGet(key, 0, field, &oldvalue, NULL) == REDISMODULE_OK) {
        // oldvalue now contains a pointer into the ziplist
        RedisModule_HashSet(key, 0, field, newvalue, NULL);
        RedisModule_ReplyWithString(ctx, oldvalue); // <---- oldvalue is invalid now!
    }

The difference to a regular owned string is more explicit if the pointer is returned as a plain const char * and along with proper documentation maybe it would be useful in some cases, but maybe we should instead go for search/seek functions like LPOS, not exposing pointers into the database. What do you think?

Allowing RedisModuleString to be integer-encoded robjs seems safe though and less radical. (As long as it's returned by the API with refcount == 1, we can convert it to an sds-string on demand, e.g. when RM_StringPtrLen() is called on it, or RM_RetainString or RM_HoldString.)

Comment From: zuiderkwast

I'm closing this, since passing around a reference to stuff inside another object without incrementing the refcount on that object (the ziplist, or the robj of type list or hash) can become quite a mess. Thanks for a good discussion anyway!

If you think integer-encoded RedisModuleString is still worth doing, please reopen or we can create another issue.

Comment From: oranagra

I think we still wanna do something (that does pass a reference to the internal memory), just didn't yet found the right design, and the ones we did all have limitations (maybe there's no perfect one, and we'll have to add a bunch of new apis). currently we don't have the bandwidth to keep looking.