Redis [NEW] Support RDMA as tranport layer protocol

Hi,

I have pushed two version of this topic: The v1(use IB verbs API directly): https://github.com/redis/redis/pull/9161 The v2(use socket like API rsocket)" https://github.com/redis/redis/pull/9270

According to the discussion both in the PR and email, @yossigo suggested that I should drive the work towards an Redis-over-RDMA spec, so I force pushed RDMA.md in my branch, I'm trying to introduce and implement the protocol for "Redis over RDMA".

Hi, @rleon @jgunthorpe @dledford, because of the professional experience at RDMA, I'd like to invite you to join this conversation. I'll be appreciated if you could give some suggestions. :)

Comment From: pizhenwei

Hi, @yossigo

Try to fully abstract connection types, to support TLS driver to work as a shared library which can be loaded dynamically. https://github.com/redis/redis/pull/9320

The next step, RDMA driver could be similar with TLS, also works as shared library.

Comment From: yoav-steinberg

@yossigo @pizhenwei I looked a bit at the branch. This looks good. A few questions:

I see that the abstraction layer abstracts ConnectionType but we still have hard coded:

#define CONN_TYPE_SOCKET            0
#define CONN_TYPE_TLS               1
#define CONN_TYPE_RDMA              2

And the code is written as if all three are available (not even #ifdefed out). We also have all relevant RDMA/TLS options in the configuration system. To me this seems like there isn't a real abstraction layer here. We could have said that redis doesn't know anything about its networking layer and by specifying the shared object in runtime we setup our networking transparently without knowing if it's RDMA/TLS/regular socket. The question is what are we aiming for?

I think we should either aim for a fully transparent networking layer where the server code receives a bind address(es) and port and a shared object and uses the networking implementation in the shared object without being aware of what it actually does. This approach can also work with static linking by choosing either socket.c/openssl.c/gnutls.c/rdma.c/etc. during build time.

There's also the question of handling multiple networking implementations simultaneously, but I'm not sure this is really a requirement.

Another question I have regarding the RDMA implementation (since I know nearly nothing about RDMA) is: does this work transparently with our ae event system. Is all RDMA handling based on file descriptors and epoll? From the code it seems this is the case. Even the lack of POLLOUT seems to be handled nicely, but I want to be sure.

Lastly my question is how can we test this. I do see that both azure and ec2 have some sort of RDMA support but it'd be good if there were some simple way to test this locally (emulation?). And if not what are the procedures to get this running on some ubiquitous cloud platform.

Comment From: dledford

I think we should either aim for a fully transparent networking layer where the server code receives a bind address(es) and port and a shared object and uses the networking implementation in the shared object without being aware of what it actually does.

That isn't necessarily always possible (or even desirable) with RDMA. The basic issue is that sockets networking is what I call "data first, buffer second" while RDMA is the opposite "buffer first, data second". In a nutshell, an RDMA queue pair is like having your own private queue on a NIC. You need to post receive buffers to it before the data arrives. Sockets, of course, you wait for the kernel to tell you data has already arrived and then you allocate a buffer and call the kernel to put the data in it. Depending on how things are abstracted out (and I admit I haven't looked at the patch), the implementation could either do very well (if it doesn't copy data around) or perform rather lackluster (if it tries to emulate the data first behavior of sockets and ends up copying data around).

Lastly my question is how can we test this. I do see that both azure and ec2 have some sort of RDMA support but it'd be good if there were some simple way to test this locally (emulation?). And if not what are the procedures to get this running on some ubiquitous cloud platform.

That issue, I can help with. Amazon only has their own proprietary RDMA capability and it's not a complete implementation of the standards, it's only partial. Azure only has Mellanox InfiniBand as a possibility. You can also test locally using the soft-RoCE driver (rxe) or the soft-iWARP driver (siw). However, these drivers are not necessarily known for being overly robust (the siw driver is better than the rxe driver). However, while we haven't announced that it's live publicly yet (because it is in the final testing stages at the moment), we, meaning the OpenFabrics Alliance, of which I am the Chair, have a cluster designed for exactly this purpose. It has InfiniBand, OmniPath Architecture, iWARP, and RoCE fabrics at the moment, possibly more (Slingshot and Gen-Z) in the future. Normally, a person gets an account through their company, which is a paid member of the OFA. However, we specifically made allowances to grant free accounts to individuals not associated with an RDMA related company and part of an upstream community that has a need to have access to RDMA hardware. That sounds exactly like this particular situation. The docs for the cluster are here and a table of the RDMA connections in the cluster can be found here. In addition to doing one off testing manually, in the future we can do CI testing on your upstream repo if desired (that's phase 2 of the project, and what we will be starting as soon as phase 1, the buildout of the initial cluster, is complete).

Comment From: pizhenwei

@yossigo @pizhenwei I looked a bit at the branch. This looks good. A few questions:

I see that the abstraction layer abstracts ConnectionType but we still have hard coded:

```c

define CONN_TYPE_SOCKET 0

define CONN_TYPE_TLS 1

define CONN_TYPE_RDMA 2

```

And the code is written as if all three are available (not even #ifdefed out). We also have all relevant RDMA/TLS options in the configuration system. To me this seems like there isn't a real abstraction layer here. We could have said that redis doesn't know anything about its networking layer and by specifying the shared object in runtime we setup our networking transparently without knowing if it's RDMA/TLS/regular socket. The question is what are we aiming for?

I think we should either aim for a fully transparent networking layer where the server code receives a bind address(es) and port and a shared object and uses the networking implementation in the shared object without being aware of what it actually does. This approach can also work with static linking by choosing either socket.c/openssl.c/gnutls.c/rdma.c/etc. during build time.

There's also the question of handling multiple networking implementations simultaneously, but I'm not sure this is really a requirement.

The first version was implemented by "link all the connection types into a linked list, and search by name", so we can remove the type definition CONN_TYPE_XXX, but @yossigo suggested that storing type into an array to make performance better. If I misunderstand this, please correct me.

Handling multiple networking implementations simultaneously is supported currently, it seems that we don't need pay any extra cost.

Another question I have regarding the RDMA implementation (since I know nearly nothing about RDMA) is: does this work transparently with our ae event system. Is all RDMA handling based on file descriptors and epoll? From the code it seems this is the case. Even the lack of POLLOUT seems to be handled nicely, but I want to be sure.

Because of the grate job of abstraction already did(the method '.ae_handler'), also abstract 'process_pending_data'. With the two function, RDMA also works fine without 'POLLOUT'. But the hiredis side can not implement the async API (no ae wrapper).

Lastly my question is how can we test this. I do see that both azure and ec2 have some sort of RDMA support but it'd be good if there were some simple way to test this locally (emulation?). And if not what are the procedures to get this running on some ubiquitous cloud platform.

As @dledford suggested, we can test RDMA by RXE. Run command like this: Old version of system: rxe_cfg start

Newer version of system: rdma link add rxe_eth0 type rxe netdev eth0

Check RDMA supported or not: rdma res show

I think RXE could do function test only, the performance test still needs hardware environment.

Comment From: yossigo

@yoav-steinberg I think there are many constraints about how far we can get with a fully abstracted networking interface.

What I had in mind here followed a more pragmatic approach: we already have the connection interface which is pretty abstract. So, with some cleanups and dynamic registration we could support additional types of connections assuming they play reasonable well with:

The existing event loop
Stream socket semantics we use

Adding a custom connection listener that accepts custom client connections under these conditions should not be difficult. Doing replication or cluster bus on top of a custom connection is potentially more difficult because endpoint address representation is hard coded into the protocols and assumes TCP (IP:PORT).

TLS is a bit special, for a few reasons: * It has a lot of configuration * The interface is less abstract and makes some more implementation-related assumptions (e.g. tlsHasPendingData and tlsProcessPendingData. * Support for it is hard coded in Redis Cluster and the cluster bus protocol.

Comment From: pizhenwei

I've been following this issue (and the related PRs linked above) for some time. I work for the former Cray team (now HPE) and have access to many networking implementations (IB, RoCE, Aries, Slingshot, etc). My team and I are happy to test this implementation on our partner systems. Longer term I think a cloud based approach using RXE for a CI will work best, but if you are interested in the performance at scale or compatibility on particular networks, we can help.

Long term, assuming good performance from the implementation, we would like to build client side RDMA capability into our Redis(AI) Clients https://github.com/CrayLabs/SmartRedis

Hi, the protocol of Redis over RDMA is still in develop. I have tested it by hiredis and redis-cluster mode, both work fine. During building the client side, something wrong or unexpected hits, please correct me. Let's discuss it in this ISSUE, or contact me by pizhenwei@bytedance.com. I tested this by redis-benchmark, and got X3 improvement (about 500K QPS), and also look forward to hearing good news from you.

Comment From: Spartee

I have built the redis-server on your feature-rdma-v2 branch on an IB cluster, but the redis-cli and redis-benchmark commands don't seem to support the --rdma argument as listed in the README.

These do, however, seem to be implemented in the feature-rdma branch, however, the implementation errors out in the linking of the CLI

redis/src/redis-cli.c:876: undefined reference to `redisConnectRdma'

Grepping through the codebase, I can't find that function implemented anywhere. Am I missing something?

Happy to also discuss this over email if people would like this thread to stay higher level.

Comment From: pizhenwei

I have built the redis-server on your feature-rdma-v2 branch on an IB cluster, but the redis-cli and redis-benchmark commands don't seem to support the --rdma argument as listed in the README.

These do, however, seem to be implemented in the feature-rdma branch, however, the implementation errors out in the linking of the CLI

redis/src/redis-cli.c:876: undefined reference to `redisConnectRdma'

Grepping through the codebase, I can't find that function implemented anywhere. Am I missing something?

Happy to also discuss this over email if people would like this thread to stay higher level.

Sorry about this issue. Because during developing hiredis side, we should create a PR to hiredis project, so there is no client supported in branch feature-rdma-v2. I just pushed a new branch with client side RDMA supported, could you please try this branch?

During test, pinning CPU for server side is suggested, then we can get a better & stable performance.

Finally, I tested this feature on both RXE(soft RoCE) and RoCE(Mellanox ConnectX-5 net card) only, I don't know there is any issue on connect/bind operations for IB cluster. Welcome to do this feature together!

Comment From: yossigo

A quick note to let you know @yoav-steinberg @oranagra and myself discussed this work today. We realize it's not getting enough attention, and the reason is strictly our lack of bandwidth - not to be confused with lack of interest.

We currently have our hands full focusing on the upcoming Redis 7 release, and need to apply strict prioritization to everything we do - but we do plan to circle back to this after that.

Comment From: pizhenwei

Contribution is welcome! According to Yossi's note, Redis core team is busy on Redis 7 release. After Redis 7, I will rebase my branch, maybe we need a little more work to resolve conflicts(if there are any).

Comment From: dledford

@pizhenwei Where is the latest code related to this? I'd like to build and test on it some. Is it this one: https://github.com/pizhenwei/redis/tree/feature-rdma-v2-with-cli

Comment From: Spartee

That's the one I'm working with currently.

Comment From: pizhenwei

@Spartee @dledford

We have moved a step forward recently, Redis supports to load a connection type from a shared library by PR. A new connection type can be loaded dynamically without any Redis source code change. Base on this feature, I opened a new PR.

Comment From: Spartee

@pizhenwei great work! I've been actively following the thread. Really excited to see this go in.

Comment From: pizhenwei

@pizhenwei great work! I've been actively following the thread. Really excited to see this go in.

Hi, @Spartee I would love to hear any suggestion/feedback about this feature&design&performance ...