Redis [NEW] How about adding a command with string matching function to 'string'?

The problem/use-case that the feature addresses The string matching function can solve many problems, and I think it is more commonly used than LCS. For example, it can search for a word in an article and check whether the data contains key information. I think your team have already discussed this, but I don't know why it wasn't added. Since we already have the LCS, why not add a string matching command that is more likely to be used than LCS?

Description of the feature We can return the index of the first match of the string by default. We can choose to return the index of the nth match, like offset option. We can return multiple indices, like count option. Additional information There are already many kinds of string matching algorithms,such as KMP, BM, Sunday, Robin-Karp, etc. If we want to implement this command, I personally think there are two algorithms that are suitable. 1.KMP Assuming that the length of the source string is m and the pattern string is n, the time complexity of KMP is O(m+n) and the space complexity is O(n). The reason I think it's suitable is because the average time spent on KMP is stable.

2.Sunday Assuming that the length of the source string is m and the pattern string is n, the time complexity of Sunday is O(mn) and the space complexity is O(n). Why its time complexity is O(mn) I still think it is suitable? Because if we do not encounter the worst case, the Sunday algorithm is much faster than KMP, even 5~6 times.

Why I don't recommend BM and Robin-Karp? For BM, the worst time complexity of BM is O(mn), and the space complexity is O(n), which is the same as Sunday. And using Sunday can be faster than BM in many cases. BM is also much more complicated than Sunday's code. If we choose to use BM, it is better to use Sunday. For Robin-Karp, it uses the hash value of two strings to compare. We know that there will always be a collision (even if it is very small chance) in the calculation of the hash value, and if two strings have the same hash value, we will return the wrong result.

Here is the code I wrote to compare the performance of KMP and Sunday: KMP :

void getNext(const char* p, int next[])
{
    int len = strlen(p);
    next[0] = -1;
    int i = 0, j = -1;

    while (i < len)
    {
        if (j == -1 || p[i] == p[j]) {
            ++i;
            ++j;
            next[i] = j;
        }
        else {
            j = next[j];
        }
    }
}
int KMP(int next[], const char* src, const char* pattern)
{
    int i = 0; int j = 0;
    int slen = strlen(src), plen = strlen(pattern);

    if (slen == 0 || plen == 0 || slen < plen) {
        return -1; /* Not found. */
    }

    while (i < slen && j < plen)
    {
        if (j == -1 || src[i] == pattern[j]) {
            i++;
            j++;
        }
        else {
            j = next[j];
        }

    }
    if (j == plen) {
        return  i - j;
    }
    else {
        return -1;
    }
}

Sunday :

void SundayPre(int shift[], const char* P, int plen)
{
    int i;
    for (i = 0; i < 128; i++) {
        shift[i] = plen + 1;
    }
    for (i = 0; i < plen; i++) {
        shift[P[i]] = plen - i;
    }
}

int Sunday(const char* src, const char* pattern) {
    int i, j;
    int slen = strlen(src), plen = strlen(pattern);

    /* 'shift' stores the distance that 'pos' should move when characters are mismatched 
     * Its index is the ASCII. Here I simply set the array size to 128, 
     * which can only store characters with ASCII values from 0 to 127.
     * Its value is the length that `pos` should be shifted back 
     * if there is a mismatch of character corresponding to the index. */
    int shift[128];
    SundayPre(shift, pattern, plen);
    int pos = 0; 
    while (pos <= slen - plen) {
        j = 0;
        while (src[pos + j] == pattern[j] && j < plen) j++;
        if (j >= plen) {
            return pos;
        }
        else {
            pos += shift[src[pos + plen]];
        }
    }
    /* not found. */
    return -1;
}

main:

int main() {

    const char* src_A = "Keep faith and hope for the future. Make your most sincere dreams, and when the opportunities come, they will fight for them. \
It may take a season or more, but the ending will not change. Ambition, best, become a reality. An uncertain future, only one step at a time, the hope can realize the dream of the highest. \
We must treasure the dream, to protect it a season, let it in the heart quietly germinal. However, we have to gently protect our hearts deep expectations, slowly dream, will achieve new life .";
    const char* pattern_A = "expectations";

    const char* src_B = "RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR\
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR\
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRedisR.";
    const char* pattern_B = "RRRRRRRRRRRRRRRRedisR";
    int idx = 0;

    /* normal test */
    clock_t startTime = clock();
    for (int j = 0; j < 1000000; j++) {
        idx = Sunday(src_A, pattern_A);
    }
    clock_t endTime = clock();  
    printf("Simulates a normal string search (matching a word in an article): \n");
    printf("index: %d\n" , idx);
    printf("Sunday algorithm time: %ld\n", (endTime - startTime));

    startTime = clock();
    for (int j = 0; j < 1000000; j++) {
        size_t len = strlen(src_A);
        int* next = new int[len];
        memset(next, 0, len * sizeof(int));
        getNext(pattern_A, next);
        idx = KMP(next, src_A, pattern_A);
    }
    endTime = clock();
    printf("index: %d\n", idx);
    printf("KMP algorithm time: %ld\n", (endTime - startTime));

    /* worst case for "Sunday" test */
    startTime = clock();
    for (int j = 0; j < 1000000; j++) {
        idx = Sunday(src_B, pattern_B);
    }
    endTime = clock();
    printf("Worst case for Sunday algorithm : \n");
    printf("index: %d\n", idx);
    printf("Sunday algorithm time: %ld\n", (endTime - startTime));

    startTime = clock();
    for (int j = 0; j < 1000000; j++) {
        size_t len = strlen(src_B);
        int* next = new int[len];
        memset(next, 0, len * sizeof(int));
        getNext(pattern_B, next);
        idx = KMP(next, src_B, pattern_B);
    }
    endTime = clock();
    printf("index: %d\n", idx);
    printf("KMP algorithm time: %ld\n", (endTime - startTime));

    return 0;
}

Test result:

Simulates a normal string search (matching a word in an article):
index: 456
Sunday algorithm time: 1143
index: 456
KMP algorithm time: 5808
Worst case for Sunday algorithm :
index: 331
Sunday algorithm time: 12083
index: 331
KMP algorithm time: 3256

We can see that Sunday performs much better than KMP in the normal case, but in the worst case for Sunday, it is O(mn) so the performance is much worse than KMP. We can also see that KMP is stable.

Comment From: zuiderkwast

The use case makes sense. Maybe first do a search on e.g. all articles on a website (full text search) and then find the individual matches within an article.

I'm fine with adding this command, but it's not up to me. If it doesn't add much complexity to Redis, it's more likely to be accepted, I think.

Things like this can also be done in Lua.

Comment From: ncghost1

@zuiderkwast Thank you for your comment.

It can be done with lua, but actually there are many commands we can done with lua (like LCS).

But not everyone knows how to use lua, and secondly, the fact that we can provide more features is one of the factors that will make more people want to use Redis.

In my opinion, string matching is used more often than LCS. I've never used LCS but I use string matching every day to search for key words in articles or files. But I still have to say it's not a necessary feature for Redis,I just think it is more useful than LCS.

Comment From: zhugezy

I was thinking of adding this algo too a few months ago, the stick note is still lying on my table ;)

LCS command(was STRALGO LCS before, see this: #9744). I'd prefer it just being a flash of thought for antirez to add this command, since adding LCS to redis is pointless, and I believe there must be some developers eagering to kick this command out of Redis ;)

About string matching. No doubt it is useful in document searching, information browsing, application developing and so more. However I doubt it is of some actual use in Redis, since users(mostly app developers and maintainers) may not be customed to storing their documents as a BIG key. I'd prefer something like VALUESCAN cursor [pattern] scanning for string values that match specific patterns. (RedisSearch?)

About the algorithm / code implement. Maybe it's too early to talk about this before confirming what we want.

Anyway, It would be better if we have users' scenarios about the 'value matching' feature in Redis.

All in all, we have these options: 1. We won't support this, this can and should be done by Lua/Function/Modules. 2. We can support something like KMP pattern key, returning something like the index(es) of the match(es). 3. We can support something like VALUESCAN cursor [MATCH pattern], just like the SCAN but scanning values of string objects that matches the pattern (other data types can be discussed later). 4. We can support both 2 and 3. 5. MORE?

Comment From: ncghost1

I was thinking of adding this algo too a few months ago, the stick note is still lying on my table ;)

LCS command(was STRALGO LCS before, see this: #9744). I'd prefer it being just a flash of light for antirez to add this command, since adding LCS to redis is pointless, and I believe there must be some developers eagering to kick this command out of Redis ;)

About string matching. No doubt it is useful in document searching, information browsing, application developing and so more. However I doubt it is of some actual use in Redis, since users(mostly app developers and maintainers) may not be customed to storing their documents as a BIG key. I'd prefer something like VALUESCAN cursor [pattern] scanning for string values that match specific patterns. (RedisSearch?)

About the algorithm / code implement. Maybe it's too early to talk about this before confirming what we want.

Anyway, It would be better if we have users' scenarios about the 'value matching' feature in Redis.

All in all, we have these options: 1. We won't support this, this can and should be done by Lua/Function/Modules. 2. We can support something like KMP pattern key, returning something like the index(es) of the match(es). 3. We can support something like VALUESCAN cursor [MATCH pattern], just like the SCAN but scanning values of string objects that matches the pattern (other data types can be discussed later). 4. We can support both 2 and 3. 5. MORE?

The VALUESCAN command sounds great, and I think it's a way to handle bigkeys of string type.I think if it can support both 2 and 3 is very nice. For example, when the pattern string we input is an all-match, we can use KMP for faster scanning.In other cases we use a matching method like scan. (Actually I haven't read the implementation of scan yet, I guess KMP is faster : ) )

Comment From: zhugezy

(Actually I haven't read the implementation of scan yet, I guess KMP is faster : ) )

You can have a read at scanGenericCommand in db.c if you like. The key-pattern matching logic in it is stringmatchlen in util.c. It's just a bare brute-force implement since we only use it for short string(keys/command names/keywords/...) matching supporting glob-style patterns(*?[^a-b]).

Comment From: ncghost1

You can have a read at scanGenericCommand in db.c if you like. The key-pattern matching logic in it is stringmatchlen in util.c. It's just a bare brute-force implement since we only use it for short string(keys/command names/keywords/...) matching supporting glob-style patterns(*?[^a-b]).

Thanks for the explanation, I'll take a look

Comment From: itamarhaber

Thanks for the suggestion @ncghost1!

We're always open to improving the project, even if it means adding new commands :) However, before deciding on the algorithm and implementing it, I'd love to have some feedback regarding the need for this capability from "real" users. Finding a match in a string is indeed a common operation, and there's no doubt it will be more efficient if done by the server, I'm just unclear whether there's a pain this actually solves, or if this is more of an API completeness thing.

Comment From: tom93

I know @itamarhaber asked for user feedback before deciding on the algorithm, but I'd like to correct @ncghost1's analysis of BM and Rabin-Karp:

When BM is used to find the first occurrence of the pattern, the worst-case time complexity is O(n+m) rather than O(nm). And adding the Galil rule allows all occurrences to be found in O(n+m), which beats the Sunday algorithm.

Rabin-Karp always produces correct results, even if there are hash collisions. When it finds a substring with the right hash it performs a full character-by-character comparison to check if it's a true match. So the consequence of a hash collision is merely some additional computation (leading to a worst-case time complexity of O(nm) when there are many collisions or many true matches).

Some additional comments:

It may be worth considering the two-way algorithm, which is used in glibc's implementation of strstr and memmem.

The implementation of the existing SCAN command (which supports glob patterns) behaves like naive O(nm) string search on patterns such as *abc* (which searches for abc anywhere in the string). But it can be easily modified to efficiently handle pattern fragments that don't contain metacharacters, by using a standard string search algorithm as a subroutine. It is also possible to construct a finite-state machine (normally used for regular expressions, but also applies to glob-style patterns).