C# Networking Deep Dive with io_uring part 5 - Threadpool Rant

#csharp #linux #performance #networking

Part 5 was going to be about integrating on Kestrel, instead this is going to be a rant about io_uring and threadpool.

This story doesn't begin with io_uring, to be honest with you I love epoll(plot twist :o) and the reason why I have been experimenting and researching io_uring for 7 months now is to understand if it truly is a better alternative to epoll.. for networking.

Now don't get me wrong, io_uring is great, I love so many things about it, originally created for Disk/File I/O and greatly excels at it but in my humble opinion it can be a mismatch when it comes to typical networking/back-end applications.

So, at this point you're probably thinking what the hell is going on and why am I saying this when so many people claim io_uring to be their coca-cola in the desert. For benchmark enthusiasts that seek to push and squeeze numbers io_uring is indeed fast, but that is exactly where it shines, micro benchmarks.

io_uring is a perfect match for the reactor pattern we have been exploring in this series, it performs especially fast when the reactor is pinned to a thread for its entire lifetime, and again, that matches perfectly with how IValueTaskSource works. I haven't been sharing any benchmarks with you but let me tell you, Minima is fast, it's frightening fast, but I'll save numbers for a future part.

This speed comes however with a shackle;

io_uring's speed in this model (Minima multi reactor) is conditional. Two things must be held simultaneously, the reactor is the sole submitter of the ring (SINGLE_ISSUE with DEFER_TASKRUN) and the handler runs inline on the reactor thread, so that the IValueTaskSource resumes without leaving it. Both things hold only as long as the handler never leaves the reactor, but guess what, the entire .NET backend world is built on leaving it, the thread pool, async/await resuming off-thread and of course Kestrel, whose model is "hand the connection to the pool". So basically the moment we await any real async work, the handler will go off reactor thread and the response can't be submitted from that thread and we are forced into a cross thread handoff. Now as we are dealing with multiple threads, all racing condition and deadlocks possibilities arise, and fixing them is not free.
**
The reactor deadlock**

Putting a SQE in the ring and bumping the SQ tail does nothing by itself, the kernel only looks at the SQ when the io_uring_enter (no SQPOLL), this is an explicit syscall and only the reactor can make it (using Minima model). The reactor wake up is however gated on completions (CQE), the loop blocks in io_uring_enter(to_submit, min_complete=1, GETEVENTS), submits whatever is pending and sleeps in the kernel until at least one CQE is available.

Let's dissect,

All connections are idle (keep-alive, no in-flight requests). Every reactor is asleep inside io_uring_enter, waiting for any completion.
A handler finishes on a pool thread and needs to send a response on connection C (owned by reactor R). It produces a SEND SQE and writes it into R's ring, bumping the tail.
But it does not call io_uring_enter (single-issuer — only R may submit). The SEND SQE now sits in the ring, unsubmitted.
R won't run again until a CQE wakes it. The only CQE that would wake it is the completion of that SEND which is never submitted, because R is the thing that submits, and R is asleep. If C was the only connection with work, no other completion is coming.

So resuming, the pool thread is waiting for the reactor to submit its SQE, the reactor is waiting for a completion that only that submission would produce, each waiting on each other deadlocked.

There are however ways to avoid this deadlock by for example using a different syscall that accepts a timeout, if no CQE are received it just times out, spinning the reactor loop. This is zerg's model solution which comes with the obvious issue that we might be hitting this timeout too oftenly, if the timeout is to large performance will take a hit, if the timeout is too small we are spinning too much when traffic is low, consuming CPU for no reason.

Minima solution for this problem is rather more elegant, an eventfd wake. After enqueuing work, the pool thread handler sends a special syscall that tells the kernel to create an artificial wake CQE to unblock the reactor, this solution comes with an ironic cost though, extra syscalls! The very same reason we tried to move away from epoll.

Can SQPOLL solve this problem?

On paper it solves the problem, a kernel thread polls the submission queue so the pool thread's SQE is picked up and submitted without reactor having to call io_uring_enter, no sleeping reactor to wake up. But oh the irony kicks in again, the poller itself may go to sleep which requires a wake up for the SQPOLL poller, same problem. On top of it SQPOLL and DEFER_TASKRUN are mutually exclusive so we surrender the very completion-batching that makes our model fast to begin with. So SQPOLL not only doesn't remove the wake, it relocates it, burns a kernel thread per reactor and makes us pay for the DEFER_TASKRUN.

We could however set SQPOLL so that it never sleeps or add a complex mechanism to have it wake up automatically, I might explore that in a future part but to be honest, for me io_uring performance with SQPOLL has proven to be subpar so I'd rather just go with epoll instead.

Now, epoll sidesteps all of this because it never separates the I/O from submitting it. The send and recv are plain syscalls any thread calls directly, a much cleaner implementation, no hacks.

So, how faster than epoll is io_uring really? On a micro benchmark where the workload isn't delegated to the threadpool, it can be a tad faster, 5-10% on my own benchmarks but on a real workload it is as fast at best, given all the issues io_uring brings such as security issues, special kernel accesses, recent kernel versions and implementation complexity.. as of today it ain't worth it in my opinion, but I might change my mind with more research.

Regardless, io_uring has its place if we want to build application where the handler/endpoint logic is lightweight, such as websockets or certain synchronous workloads as frameworks like redis or nginx strive in (and it not by chance that these use the multi reactor architecture).

For the next parts I will continue with Minima, explore the reactor pattern and send branch, even though io_uring may not be a good fit for Kestrel, it can still be applied in certain application types delivering exceptional performance.

DEV Community

C# Networking Deep Dive with io_uring part 5 - Threadpool Rant

Top comments (0)