DEV Community: Diogo Martins

C# Networking Deep Dive with io_uring part 6 - Numbers

Diogo Martins — Wed, 27 May 2026 14:57:15 +0000

For part 6 let's do some benchmarks;

What is going to be benchmarked

io_uring read+write with IVTS reactor inline continuations (RunAsynchrounousContinuation = false)
io_uring read+write without IVTS reactor inline continuations(threadpool) (RunAsynchrounousContinuation = true)
io_uring read + libc send write without IVTS reactor inline continuations(threadpool) (RunAsynchrounousContinuation = true)
epoll read+write with IVTS reactor inline continuations
epoll read+write without IVTS reactor inline continuations
System.Net.Socket (Kestrel stock) - epoll threadpool

Tests

(No pipelining)

Synchronous lightweight plaintext "OK" response.
Asynchronous workload to serialize a very large object.

Hardware

i9 14900k
64GB DDR5 6400MHz
Linux Kernel 6.17.0-22-generic

Tests are done through localhost loopback (no NIC influence)
MTU 1500

Load generators

Http/1.1 no TLS

wrk (epoll)
gcannon (io_uring)

io_uring read+write with IVTS reactor inline continuations

This is the exact model explored throughout the series, expected to deliver high performance on synchronous test.

Reactor count: 12

Sync Workload

wrk -c 512 -t18 -d5s http://localhost:8080/

18 threads and 512 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
  Latency   121.45us  178.81us   8.32ms   99.05%
  Req/Sec   201.31k    40.61k  350.92k    73.09%
  18299278 requests in 5.10s, 1.12GB read
Requests/sec: 3588059.25
Transfer/sec:    225.84MB

gcannon http://localhost:8080/ -c 512 -t 16 -d 5

gcannon v0.5.3
Target:    localhost:8080/
Threads:   16
Conns:     512 (32/thread)
Pipeline:  1
Req/conn:  unlimited (keep-alive)
Expected:  200
Duration:  5s

Thread Stats   Avg      p50      p90      p99    p99.9
  Latency    129us    125us    185us    245us    317us

19735722 requests in 5.00s, 19735721 responses
Throughput: 3.95M req/s
Bandwidth:  248.42MB/s
Status codes: 2xx=19735721, 3xx=0, 4xx=0, 5xx=0
Latency samples: 19735657 / 19735721 responses (100.0%)

Async Workload (Very unstable)

wrk -c 512 -t18 -d5s http://localhost:8080/

18 threads and 512 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
  Latency   435.74us  795.84us  12.73ms   88.81%
  Req/Sec   142.93k    29.31k  265.52k    68.29%
12883294 requests in 5.10s, 810.91MB read
Requests/sec: 2526866.89
Transfer/sec:    159.05MB

gcannon http://localhost:8080/ -c 512 -t 16 -d 5

gcannon v0.5.3
Target:    localhost:8080/
Threads:   16
Conns:     512 (32/thread)
Pipeline:  1
Req/conn:  unlimited (keep-alive)
Expected:  200
Duration:  5s

Thread Stats   Avg      p50      p90      p99    p99.9
  Latency    185us    135us    229us   1.84ms   4.10ms

13797048 requests in 5.00s, 13797048 responses
Throughput: 2.76M req/s
Bandwidth:  173.67MB/s
Status codes: 2xx=13797048, 3xx=0, 4xx=0, 5xx=0
Latency samples: 13796999 / 13797048 responses (100.0%)

io_uring read+write without IVTS reactor inline

Similar model explored throughout the series but with RunAsynchronousContinuation set to true on both IVTS, expected to deliver close results on both tests.

Reactor count: 12

Sync Workload

wrk -c 512 -t18 -d5s http://localhost:8080/

18 threads and 512 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
  Latency   515.72us  821.99us  12.67ms   87.67%
  Req/Sec   110.03k    21.14k  212.25k    71.55%
9946282 requests in 5.10s, 626.04MB read
Requests/sec: 1950919.66
Transfer/sec:    122.80MB

gcannon http://localhost:8080/ -c 512 -t 16 -d 5

gcannon v0.5.3
Target:    localhost:8080/
Threads:   16
Conns:     512 (32/thread)
Pipeline:  1
Req/conn:  unlimited (keep-alive)
Expected:  200
Duration:  5s

Thread Stats   Avg      p50      p90      p99    p99.9
  Latency    211us    164us    273us   1.55ms   3.79ms

12080236 requests in 5.00s, 12080325 responses
Throughput: 2.41M req/s
Bandwidth:  151.97MB/s
Status codes: 2xx=12080325, 3xx=0, 4xx=0, 5xx=0
Latency samples: 12080192 / 12080325 responses (100.0%)

Async Workload

wrk -c 512 -t18 -d5s http://localhost:8080/

18 threads and 512 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
  Latency   530.17us  842.05us  13.37ms   87.50%
  Req/Sec   108.43k    26.31k  204.89k    71.33%
9726083 requests in 5.03s, 612.18MB read
Requests/sec: 1935462.26
Transfer/sec:    121.82MB

gcannon http://localhost:8080/ -c 512 -t 16 -d 5

gcannon v0.5.3
Target:    localhost:8080/
Threads:   16
Conns:     512 (32/thread)
Pipeline:  1
Req/conn:  unlimited (keep-alive)
Expected:  200
Duration:  5s

Thread Stats   Avg      p50      p90      p99    p99.9
  Latency    213us    146us    265us   2.27ms   4.38ms

11952675 requests in 5.00s, 11952749 responses
Throughput: 2.39M req/s
Bandwidth:  150.45MB/s
Status codes: 2xx=11952749, 3xx=0, 4xx=0, 5xx=0
Latency samples: 11952633 / 11952749 responses (100.0%)

io_uring read + libc send write without IVTS reactor inline continuations

Similar model explored throughout the series but with RunAsynchronousContinuation set to true on both IVTS and the write branch is not io_uring, instead we use the libc's send, expected to deliver close results on both tests. This is an hybrid approach and should be the middle ground between the first two models.

Reactor count: 12

Sync Workload

wrk -c 512 -t18 -d5s http://localhost:8080/

18 threads and 512 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
  Latency   410.23us  782.03us  12.08ms   87.21%
  Req/Sec   158.40k    45.57k  251.18k    63.78%
14361239 requests in 5.10s, 0.88GB read
Requests/sec: 2817277.09

gcannon http://localhost:8080/ -c 512 -t 16 -d 5

gcannon v0.5.3
Target:    localhost:8080/
Threads:   16
Conns:     512 (32/thread)
Pipeline:  1
Req/conn:  unlimited (keep-alive)
Expected:  200
Duration:  5s

Thread Stats   Avg      p50      p90      p99    p99.9
  Latency    154us     84us    176us   2.68ms   4.32ms

16551871 requests in 5.00s, 16551875 responses
Throughput: 3.31M req/s
Bandwidth:  208.27MB/s
Status codes: 2xx=16551875, 3xx=0, 4xx=0, 5xx=0
Latency samples: 16551825 / 16551875 responses (100.0%)

Async Workload

wrk -c 512 -t18 -d5s http://localhost:8080/

18 threads and 512 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
  Latency   418.96us  824.32us  17.51ms   88.51%
  Req/Sec   154.72k    25.68k  240.94k    68.76%
13955371 requests in 5.09s, 0.86GB read
Requests/sec: 2742025.94
Transfer/sec:    172.59MB

gcannon http://localhost:8080/ -c 512 -t 16 -d 5

gcannon v0.5.3
Target:    localhost:8080/
Threads:   16
Conns:     512 (32/thread)
Pipeline:  1
Req/conn:  unlimited (keep-alive)
Expected:  200
Duration:  5s

Thread Stats   Avg      p50      p90      p99    p99.9
  Latency    159us     85us    198us   1.99ms   4.41ms

15997491 requests in 5.00s, 15997498 responses
Throughput: 3.20M req/s
Bandwidth:  201.18MB/s
Status codes: 2xx=15997498, 3xx=0, 4xx=0, 5xx=0
Latency samples: 15997425 / 15997498 responses (100.0%)

epoll read+write with IVTS reactor inline continuations

Pure epoll approach with same reactor threading architecture. Inline handler continuation for both IVTS.

Reactor count: 12

Sync Workload

wrk -c 512 -t18 -d5s http://localhost:8080/

18 threads and 512 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
  Latency   284.42us  610.90us  11.06ms   91.79%
  Req/Sec   188.08k    42.17k  288.89k    60.15%
17141225 requests in 5.10s, 2.01GB read
Requests/sec: 3358876.80
Transfer/sec:    403.61MB

gcannon http://localhost:8080/ -c 512 -t 16 -d 5

gcannon v0.5.3
Target:    localhost:8080/
Threads:   16
Conns:     512 (32/thread)
Pipeline:  1
Req/conn:  unlimited (keep-alive)
Expected:  200
Duration:  5s

Thread Stats   Avg      p50      p90      p99    p99.9
  Latency    160us     86us    194us   2.07ms   4.39ms

15856691 requests in 5.00s, 15856698 responses
Throughput: 3.17M req/s
Bandwidth:  199.56MB/s
Status codes: 2xx=15856698, 3xx=0, 4xx=0, 5xx=0
Latency samples: 15856636 / 15856698 responses (100.0%)

Async Workload

wrk -c 512 -t18 -d5s http://localhost:8080/

18 threads and 512 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
  Latency   458.63us    0.90ms  15.96ms   88.39%
  Req/Sec   150.84k    25.75k  232.74k    65.71%
13670697 requests in 5.10s, 1.60GB read
Requests/sec: 2680674.42
Transfer/sec:    322.12MB

gcannon http://localhost:8080/ -c 512 -t 16 -d 5

gcannon v0.5.3
Target:    localhost:8080/
Threads:   16
Conns:     512 (32/thread)
Pipeline:  1
Req/conn:  unlimited (keep-alive)
Expected:  200
Duration:  5s

Thread Stats   Avg      p50      p90      p99    p99.9
  Latency    159us     74us    185us   2.68ms   5.32ms

15386279 requests in 5.00s, 15386278 responses
Throughput: 3.08M req/s
Bandwidth:  369.72MB/s
Status codes: 2xx=15386278, 3xx=0, 4xx=0, 5xx=0
Latency samples: 15386230 / 15386278 responses (100.0%)

epoll read+write without IVTS reactor inline continuations

Pure epoll approach with same reactor threading architecture. Threadpool handler continuation for both IVTS.

Reactor count: 6

Sync Workload

wrk -c 512 -t18 -d5s http://localhost:8080/

18 threads and 512 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
  Latency   391.31us  764.42us  13.71ms   88.16%
  Req/Sec   167.26k    26.31k  244.01k    75.88%
15179066 requests in 5.10s, 1.78GB read
Requests/sec: 2975933.84
Transfer/sec:    357.60MB

gcannon http://localhost:8080/ -c 512 -t 16 -d 5

gcannon v0.5.3
Target:    localhost:8080/
Threads:   16
Conns:     512 (32/thread)
Pipeline:  1
Req/conn:  unlimited (keep-alive)
Expected:  200
Duration:  5s

Thread Stats   Avg      p50      p90      p99    p99.9
  Latency    140us     96us    150us   2.06ms   4.15ms

18019801 requests in 5.00s, 18019801 responses
Throughput: 3.60M req/s
Bandwidth:  432.83MB/s
Status codes: 2xx=18019801, 3xx=0, 4xx=0, 5xx=0
Latency samples: 18019763 / 18019801 responses (100.0%)

Async Workload

wrk -c 512 -t18 -d5s http://localhost:8080/

18 threads and 512 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
  Latency   464.15us  838.78us  10.74ms   87.28%
  Req/Sec   158.12k    14.36k  266.80k    72.35%
14231176 requests in 5.10s, 1.18GB read
Requests/sec: 2790992.53
Transfer/sec:    236.89MB

gcannon http://localhost:8080/ -c 512 -t 16 -d 5

gcannon v0.5.3
Target:    localhost:8080/
Threads:   16
Conns:     512 (32/thread)
Pipeline:  1
Req/conn:  unlimited (keep-alive)
Expected:  200
Duration:  5s

Thread Stats   Avg      p50      p90      p99    p99.9
  Latency    154us     96us    154us   2.22ms   4.48ms

16342325 requests in 5.00s, 16342325 responses
Throughput: 3.27M req/s
Bandwidth:  277.35MB/s
Status codes: 2xx=16342325, 3xx=0, 4xx=0, 5xx=0
Latency samples: 16342273 / 16342325 responses (100.0%)

System.Net.Socket (Kestrel stock) - epoll threadpool

Kestrel's stock network I/O with some tunning:

listener.SetSocketOption(SocketOptionLevel.Socket, SocketOptionName.ReuseAddress, true);
client.NoDelay = true;   // TCP_NODELAY

Sync Workload

wrk -c 512 -t18 -d5s http://localhost:8080/

18 threads and 512 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
  Latency   156.79us  342.31us   6.98ms   96.45%
  Req/Sec   174.25k    35.85k  266.63k    73.35%
15748223 requests in 5.10s, 0.97GB read
Requests/sec: 3088338.61
Transfer/sec:    194.39MB

gcannon http://localhost:8080/ -c 512 -t 16 -d 5

gcannon v0.5.3
Target:    localhost:8080/
Threads:   16
Conns:     512 (32/thread)
Pipeline:  1
Req/conn:  unlimited (keep-alive)
Expected:  200
Duration:  5s

Thread Stats   Avg      p50      p90      p99    p99.9
  Latency    141us    129us    176us    305us   3.17ms

18024579 requests in 5.00s, 18024579 responses
Throughput: 3.60M req/s
Bandwidth:  226.84MB/s
Status codes: 2xx=18024579, 3xx=0, 4xx=0, 5xx=0
Latency samples: 18024567 / 18024579 responses (100.0%)

Async Workload

wrk -c 512 -t18 -d5s http://localhost:8080/

18 threads and 512 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
  Latency   255.07us  507.29us  12.53ms   93.36%
  Req/Sec   150.64k    15.91k  235.46k    73.35%
13618906 requests in 5.10s, 857.21MB read
Requests/sec: 2671254.72
Transfer/sec:    168.14MB

gcannon http://localhost:8080/ -c 512 -t 16 -d 5

gcannon v0.5.3
Target:    localhost:8080/
Threads:   16
Conns:     512 (32/thread)
Pipeline:  1
Req/conn:  unlimited (keep-alive)
Expected:  200
Duration:  5s

Thread Stats   Avg      p50      p90      p99    p99.9
  Latency    169us    123us    237us   1.25ms   3.89ms

15043820 requests in 5.00s, 15043820 responses
Throughput: 3.01M req/s
Bandwidth:  189.25MB/s
Status codes: 2xx=15043820, 3xx=0, 4xx=0, 5xx=0
Latency samples: 15043756 / 15043820 responses (100.0%)

C# Networking Deep Dive with io_uring part 5 - Threadpool Rant

Diogo Martins — Sun, 24 May 2026 17:32:54 +0000

Part 5 was going to be about integrating on Kestrel, instead this is going to be a rant about io_uring and threadpool.

This story doesn't begin with io_uring, to be honest with you I love epoll(plot twist :o) and the reason why I have been experimenting and researching io_uring for 7 months now is to understand if it truly is a better alternative to epoll.. for networking.

Now don't get me wrong, io_uring is great, I love so many things about it, originally created for Disk/File I/O and greatly excels at it but in my humble opinion it can be a mismatch when it comes to typical networking/back-end applications.

So, at this point you're probably thinking what the hell is going on and why am I saying this when so many people claim io_uring to be their coca-cola in the desert. For benchmark enthusiasts that seek to push and squeeze numbers io_uring is indeed fast, but that is exactly where it shines, micro benchmarks.

io_uring is a perfect match for the reactor pattern we have been exploring in this series, it performs especially fast when the reactor is pinned to a thread for its entire lifetime, and again, that matches perfectly with how IValueTaskSource works. I haven't been sharing any benchmarks with you but let me tell you, Minima is fast, it's frightening fast, but I'll save numbers for a future part.

This speed comes however with a shackle;

io_uring's speed in this model (Minima multi reactor) is conditional. Two things must be held simultaneously, the reactor is the sole submitter of the ring (SINGLE_ISSUE with DEFER_TASKRUN) and the handler runs inline on the reactor thread, so that the IValueTaskSource resumes without leaving it. Both things hold only as long as the handler never leaves the reactor, but guess what, the entire .NET backend world is built on leaving it, the thread pool, async/await resuming off-thread and of course Kestrel, whose model is "hand the connection to the pool". So basically the moment we await any real async work, the handler will go off reactor thread and the response can't be submitted from that thread and we are forced into a cross thread handoff. Now as we are dealing with multiple threads, all racing condition and deadlocks possibilities arise, and fixing them is not free.
**
The reactor deadlock**

Putting a SQE in the ring and bumping the SQ tail does nothing by itself, the kernel only looks at the SQ when the io_uring_enter (no SQPOLL), this is an explicit syscall and only the reactor can make it (using Minima model). The reactor wake up is however gated on completions (CQE), the loop blocks in io_uring_enter(to_submit, min_complete=1, GETEVENTS), submits whatever is pending and sleeps in the kernel until at least one CQE is available.

Let's dissect,

All connections are idle (keep-alive, no in-flight requests). Every reactor is asleep inside io_uring_enter, waiting for any completion.
A handler finishes on a pool thread and needs to send a response on connection C (owned by reactor R). It produces a SEND SQE and writes it into R's ring, bumping the tail.
But it does not call io_uring_enter (single-issuer — only R may submit). The SEND SQE now sits in the ring, unsubmitted.
R won't run again until a CQE wakes it. The only CQE that would wake it is the completion of that SEND which is never submitted, because R is the thing that submits, and R is asleep. If C was the only connection with work, no other completion is coming.

So resuming, the pool thread is waiting for the reactor to submit its SQE, the reactor is waiting for a completion that only that submission would produce, each waiting on each other deadlocked.

There are however ways to avoid this deadlock by for example using a different syscall that accepts a timeout, if no CQE are received it just times out, spinning the reactor loop. This is zerg's model solution which comes with the obvious issue that we might be hitting this timeout too oftenly, if the timeout is to large performance will take a hit, if the timeout is too small we are spinning too much when traffic is low, consuming CPU for no reason.

Minima solution for this problem is rather more elegant, an eventfd wake. After enqueuing work, the pool thread handler sends a special syscall that tells the kernel to create an artificial wake CQE to unblock the reactor, this solution comes with an ironic cost though, extra syscalls! The very same reason we tried to move away from epoll.

Can SQPOLL solve this problem?

On paper it solves the problem, a kernel thread polls the submission queue so the pool thread's SQE is picked up and submitted without reactor having to call io_uring_enter, no sleeping reactor to wake up. But oh the irony kicks in again, the poller itself may go to sleep which requires a wake up for the SQPOLL poller, same problem. On top of it SQPOLL and DEFER_TASKRUN are mutually exclusive so we surrender the very completion-batching that makes our model fast to begin with. So SQPOLL not only doesn't remove the wake, it relocates it, burns a kernel thread per reactor and makes us pay for the DEFER_TASKRUN.

We could however set SQPOLL so that it never sleeps or add a complex mechanism to have it wake up automatically, I might explore that in a future part but to be honest, for me io_uring performance with SQPOLL has proven to be subpar so I'd rather just go with epoll instead.

Now, epoll sidesteps all of this because it never separates the I/O from submitting it. The send and recv are plain syscalls any thread calls directly, a much cleaner implementation, no hacks.

So, how faster than epoll is io_uring really? On a micro benchmark where the workload isn't delegated to the threadpool, it can be a tad faster, 5-10% on my own benchmarks but on a real workload it is as fast at best, given all the issues io_uring brings such as security issues, special kernel accesses, recent kernel versions and implementation complexity.. as of today it ain't worth it in my opinion, but I might change my mind with more research.

Regardless, io_uring has its place if we want to build application where the handler/endpoint logic is lightweight, such as websockets or certain synchronous workloads as frameworks like redis or nginx strive in (and it not by chance that these use the multi reactor architecture).

For the next parts I will continue with Minima, explore the reactor pattern and send branch, even though io_uring may not be a good fit for Kestrel, it can still be applied in certain application types delivering exceptional performance.

C# Networking Deep Dive with io_uring part 4 - Zero Copy Receive

Diogo Martins — Mon, 18 May 2026 15:47:08 +0000

In this part 4 we are exploring a "side quest", io_uring zero copy receive mechanism, even though I'll still add the scaffold for the code, this will be more of a theoretical part as I don't have a network card that supports this so can't test. Future parts will not use zero copy receive mechanism.

Feel free to skip this part as it isn't required or will impact further ones.

We usually see a lot of "hot path zero allocation super ultra fast socket server" here and there when people advertise their projects or frameworks. Well that's cute, pre allocate some memory and reuse it, but how about zero copy?

When we receive external data via network through a network interface card (NIC) typically the NIC uses DMA to write this data to kernel memory, which the kernel then copies to a memory space our apps can access. This copy is what can be avoided by having the NIC use DMA to write the data bytes directly into user space accessible memory instead of kernel memory.

What is DMA?

Direct Memory Access is a hardware capability that lets a device read or write to RAM "on its own" with no interaction of the CPU. Without DMA, the only way to get data from a device into memory is through the CPU reading from the device's register and copying it to RAM, the CPU is busy during the copy process, for a NIC pushing gigabits/sec this would consume the whole core.

For our case (NIC) the driver (CPU) pre populates a ring of Rx descriptors, each pointing at a RAM buffer. When a packet arrives off the wire, the NIC's DMA writes the packet bytes straight into the next buffer in that ring, flags the descriptor done and interrupts the CPU via NAPI. The CPU never copied thee packet, the NIC placed it in RAM itself.

DMAs target physical addresses, not the virtual addresses our app/program sees. This address translation/pinning is why io_uring zcrx (zero copy receive) has to register our memory with the kernel first.

DMA is present on every normal network receive, the NIC DMAs the packet into kernel RAM buffers. The io_uring zcrx idea is to change the descriptor's target address so that the DMA lands in our registered memory instead, avoiding the extra kernel copy.

So, what changed?

In previous parts our recv path used an io_uring provided buffer ring, we allocated a big slab, sliced it into buffers and handed it to the kernel. When data arrived the kernel picked a buffer and copied the data bytes into it. We want to remove that copying having the NIC DMA the received bytes directly to our buffers.

To avoid making this part too extensive I'll focus on the main changes.

Let's do an high level comparison between Minima (parts 1-3) and MinimaZero (part 4 with zcrx).

Who fills the buffer?
Minima - Kernel memcpys into our slab.
MinimaZero - NIC DMAs into our registered area.

This is the pivotal difference, everything that follows exists to support it. In parts 1-3 the kernel receives the packet into its own memory and memcpys the payload to one of our pre registered slab buffers. With zero copy rx the NIC DMAs the payload straight into a memory area we register, avoiding the the kernel copy.

What we register?
Minima - Provided buffer ring PBUF_RING.
MinimaZero - zcrx ifq bound to NIC Rx queue (ZCRX_IFQ).

Parts 1-3 register a provided buffer ring with IORING_REGISTER_PBUF_RING, a pool of our own memory the kernel can copy into. In part 4 we register a zcrx interface queue with IORING_REEGISTER_ZCRX_IFQ which binds our memory area to one specific NIC hardware receive queue, "wiring" our memory with the NIC's DMA path.

Recv operation (to be multishotted)
Minima - RECV + IOSQE_BUFFER_SELECT
MinimaZero - RECV_ZC multishot, no buffer selection

Parts 1-3 use IORING_OP_RECV with IOSQE_BUFFER_SELECT flag which basically tells the kernel to pick a buffer from the provided ring only when data arrives, this makes idle connections cheap. In part 4 we use IORING_OP_RECV_ZC without buffer selection as the destination was set when the ifq is registered. Both work with multishot.

Completion
Minima - 16-byte CQE, buffer id in flags
MinimaZero - 32-byte CQE, CQE32 plus trailing zcrx_cqe

In part 4 the 16 bytes are not enough as the kernel needs to include information of where inside our area the NIC wrote.

Locating the data
Minima - slab + bid*size
MinimaZero - area + (off & ~AREA_MASK) from the token

Each completion points to the bytes, parts 1-3 own a fixed numbered slots so it's simple arithmetic, slab pointer plus buffer id times buffer size. In part 4 we don't own numbered slots, the NIC picks where to write in the provided area.

Returning a buffer
Minima - ReturnBuffer
MinimaZero - refill queue entry RefillRqe

Similar lifecycle for both, buffers must be "handed back" so that they can be used again. What changed is that in parts 1-3 we return the buffer id to the provided buffer ring with ReturnBuffer, in part 4 we post an entry to the RefillRqe, this entry is basically a descriptor in the DMA area.

Concurrency
Minima - N reactors
MinimaZero - 1 reactor, 1 ifq, one HW queue

Concurrency changed a lot and since I cannot test zcrx due to not having a NIC that supports it, I could not understand or optimize what is the best way to set this up.

In parts 1-3 we run N reactors, each has its own ring and buffer pool, using SO_REUSEPORT the kernel spreads incoming connetions across the reactors. zcrx breaks that, the ifq binds to one hardware receive queue and is steered by the NIC's flow, this basically means that a connection and its zero copy bytes can end up on different threads which breaks the multi reactor architecture where connections are owned by reactors that never thread hop. While it is still possible to have multi reactor patter with zcrx by having multiple ifq, I could not test it so won't cover it.

Host setup
Minima - None
MinimaZero - ethtool split + steering, NIC + kernel >= 6.15

Now the code

I decided to not include any code for this part as I cannot test it.

You can find my scaffolding here
It is a port to C# from an existing C implementation plus some "theoretical" changes I cannot test, might be useful in the future if I managed to get my hands on a NIC that supports this.

Sources:

https://docs.kernel.org/networking/iou-zcrx.html
https://www.youtube.com/watch?v=LCuTSNDe1nQ
https://www.youtube.com/watch?v=WQ22zAPBSnQ
https://github.com/torvalds/linux/blob/master/io_uring/zcrx.c

C# Networking Deep Dive With io_uring part 3 - Touching the bytes

Diogo Martins — Wed, 13 May 2026 16:07:41 +0000

In part 2 we introduced an asynchronous API for reading data from the wire, where only the number of received bytes was considered. On this part 3 let's extend it to access the actual received data.

As usual, the entire source code can be found at Minima

Data will be pushed to the CQ shared ring buffers by the kernel whenever data arrives from the wire, this can represent a partial, complete or more than one request. By request I mean it could be a HTTP/1.1, HTTP/2, gRPC, websocket, etc pretty much a request from any protocol. Whenever we call a ReadAsync, we receive a RecvSnapshot, metadata for CQEs snapshot, currently this metadata only includes the number of bytes from each CQE, we need to add a byte* pointing to where the data is.

public struct Item
{
    public byte* Ptr; // new
    public ushort Bid;
    public int Len;
    public bool HasBuffer;

    public ReadOnlySpan<byte> AsSpan() => new(Ptr, Len); // new
}

then on the reactor's dispatch receive branch

else if (kind == KindRecv)
{
    bool   hasBuf = (cqe.flags & IORING_CQE_F_BUFFER) != 0;
    ushort bid    = hasBuf ? (ushort)(cqe.flags >> IORING_CQE_BUFFER_SHIFT) : (ushort)0;

    (...)

    byte* ptr = hasBuf ? _bufSlab + (nuint)bid * (nuint)BufferSize : null;
    conn.Complete(cqe.res, bid, hasBuf, ptr);

    (...)
}

_hasBuf - True when the kernel attached a provided buffer to this completion.
bid - the index of the buffer slot the kernel picked from the provided-buffer ring for this recv.
_bufSlab - The contiguous unmanaged memory block backing every recv buffer the kernel can DMA into for this reactor.

So ptr points to the "slot" which can be calculated by knowing the buffer Id and the size of each buffer. These slots are contiguous memory allocated during initialization.

conn.Complete adds a new Item to our SPSC ring. In case you don't remember previous parts, each CQE has a "kind", kind==KindRecv means that this CQE signals data was received from the wire.

Now for each received CQE we can access the byte* where kernel stored the received data, each ReadAsync will return a snapshot that contains one or more Items, each Item contains the metadata for one CQE. On the handler side we must consume this data. We don't want to be dealing with pointers though, that would force us to use unsafe everywhere we touch the received data.

We already have the ReadOnlySpan view of the data via the AsSpan() but spans are ref structs and can't be freely used anywhere, why is that? Spans are used to create views over stack allocated data unlike its heap allocated counterpart Memory/ReadOnlyMemory, we can't directly use ReadOnlyMemory because it can't be directly created from a byte* even though this byte* points at heap allocated data stored in each reactor's _bufSlab we initialize for each reactor.

Enter UnmanagedMemoryManager

public sealed unsafe class UnmanagedMemoryManager : MemoryManager<byte>
{
    private readonly byte* _ptr;
    private readonly int _length;

    public ushort BufferId { get; }

    public byte* Ptr => _ptr;

    public int Length => _length;

    public UnmanagedMemoryManager(byte* ptr, int length)
    {
        _ptr = ptr;
        _length = length;
    }

    public UnmanagedMemoryManager(byte* ptr, int length, ushort bufferId)
    {
        _ptr = ptr;
        _length = length;
        BufferId = bufferId;
    }

    public override Span<byte> GetSpan() => new Span<byte>(_ptr, _length);

    public override MemoryHandle Pin(int elementIndex = 0) => new MemoryHandle(_ptr + elementIndex);

    public override void Unpin() { }

    public void Free()
    {
        if (_ptr != null)
        { 
            NativeMemory.AlignedFree(_ptr); 
        }
    }

    protected override void Dispose(bool disposing) { }
}

Similar to Span and Memory, UnmanagedMemoryManager is a view over memory, the abstract class method implementations are just protocol, the data is already comes pinned by default. Pin is essentially a struct construction, zero cost at runtime. It exists purely to satisfy the contract. The actual cost is creating a new UnmanagedMemoryManager for each Item, this can also be avoided by pre allocating all the possible UnmanagedMemoryManager. Every bid maps to a fixed address in the slab: _bufSlab + bid * BufferSize. The pointer for a given bid never changes. Only Len varies per recv. So we can pre-allocate one manager per slot at reactor init and reuse it forever.

UnmanagedMemoryManager is the bridge between safe and unsafe code, by inheriting from MemoryManager it can be exposed as Memory and plugged into the entire BCL ecosystem for free:

PipeReader / PipeWriter
Stream.ReadAsync(Memory) / WriteAsync(ReadOnlyMemory)
ReadOnlySequence (built from ReadOnlyMemory segments)
IBufferWriter
Any async API that takes Memory

This is especially useful for ReadOnlySequence which is very handy when dealing with TCP fragmentation.

This is how zero allocation is achieved on the receiving branch, the data is written to pre-allocated slots and we directly read from them. This data can be parsed by slicing over it and is valid until we return the CQE's buffer Id as can be seen in the snipped below through conn.ReturnBuffer. After returning each buffer, the kernel may reuse that "slot" for new incoming data so its data can be invalid/overwritten.

So, how does how handler look now?

public static async Task HandleAsync(Reactor reactor, int fd, Connection conn)
{
    try
    {
        while (true)
        {
            RecvSnapshot snap = await conn.ReadAsync();

            while (conn.TryGetItem(snap, out SpscRecvRing.Item item))
            {
                if (item.HasBuffer)
                {
                    UnmanagedMemoryManager mem = item.AsMemoryManager();
                    ReadOnlyMemory<byte> data = mem.Memory;
                    // data is now usable with any BCL Memory<byte>/async API
                    _ = data.Length;

                    reactor.ReturnBuffer(mem.BufferId);
                }
                conn.QueueResponse(fd);
            }

            if (snap.IsClosed)
            {
                conn.Close(fd);
                return;
            }

            conn.ResetRead();
        }
    }
    catch (Exception ex)
    {
        Console.Error.WriteLine($"[r{reactor.Id}] handler crash on fd={fd}: {ex}");
        conn.Close(fd);
    }
}

The possibilites are now endless, we can build a ReadOnlySequence from all the data to facilitate slicing across multiple segments, also in the case of incomplete requests we can again create a ReadOnlySequence, call another ReadAsync and add the received segments to the already existing ReadOnlySequence.

C# Networking Deep Dive With io_uring part 2 - Bridge the Async Model

Diogo Martins — Sun, 10 May 2026 15:37:18 +0000

In part 1 we built the minimal io_uring loop: setup, mmaps, SQE/CQE draining, accept/recv/send via opcode dispatch.This was the first step to understanding how the kernel interface works but the dispatch logic was basically hand coded, a state machine that would not scale.
This second part is all about introducing the asynchronous model to await data pushed by the kernel. For simplicity, in this article scope we will simply return the number of bytes received, in later parts it will be described how to adapt this to return the actual request data and parse it.

As usual, the full source code can be found at ....

Why not just use Task?

The direct option would be to just return a Task, completed by the dispatcher when a CQE arrives, to avoid the allocations this would cause for every asynchronous read we are going with the zero allocation option. ValueTask over a reusable source, the high performance path, the source is a single object that lives as long as the TCP connection.

public interface IValueTaskSource<out TResult> 
{
   TResult GetResult(short token);

   ValueTaskSourceStatus GetStatus(short token);

   void OnCompleted(Action<object?> continuation, object? state, short token, ValueTaskSourceOnCompletedFlags flags);

}

These three methods are all the runtime needs to drive the await "machinery". The token is just a safety net, every Reset() bumps an internal version, like a generation counter so that a stale awaiter passing an old token gets caught.

Typically we don't need to implement these three methods from scratch, the BCL provides ManualResetValueTaskSourceCore, a struct that holds/owns the value, version, captured continuation and the scheduling context.

We can delegate the interface methods to it

private ManualResetValueTaskSourceCore<RecvSnapshot> _readSignal;        

RecvSnapshot IValueTaskSource<RecvSnapshot>.GetResult(short token) => _readSignal.GetResult(token);     

ValueTaskSourceStatus IValueTaskSource<RecvSnapshot>.GetStatus(short token) => _readSignal.GetStatus(token);

void IValueTaskSource<RecvSnapshot>.OnCompleted(Action<object?> c, object? s, short t, ValueTaskSourceOnCompletedFlags f)                         
      => _readSignal.OnCompleted(c, s, t, f);

The generic T is RecvSnapshot, a small struct that captures what is available to read, the handler awaits this snapshot and drains the items it covers. This will be covered further on, for now just think of it as a snapshot that points to a circular buffer tail, the reason we need this and can't just drain the entire buffer is because we can receive data as we process this, the tail can change anytime and that would potentially cause a desync when extracting data.

_readSignal exposes Version, Reset(), SetResult(value) and SetException(ex), that is the whole API surface we will use. The "Manual" in ManualResetValueTaskSourceCore is the contract, Reset() must be called between completions, without it the next SetResult() would throw, this is intentional design, reuses without explicit resets could mask state management bugs.

Meeting point

Each Connection has two sides:

Producer, the CQE dispatcher. When a CQE arrives it must be delivered.
Consumer, the application handler parked on await ReadAsync(), waiting for that result.

Most of the time these two are synchronized in the simplest case: the consumer parks waiting for data, the producer wakes it. There are however gaps that need to be addressed.

Consumer calls ReadAsync() to park but the producer already had a result read from a previous CQE. In this case we don't want to block/await, we want to return the buffered result synchronously.
Producer fires a result but the consumer hasn't called ReadAsync() yet. This can typically happen between two consumer awaits, we can't drop the result or overwrite it if more than one CQE arrives back to back.

Both cases boil down to the same issue/need, a buffer between the producer and consumer so that data is preserved no matter what runs first. A solution is a bounded single producer single consumer ring, producer enqueues from the dispatcher and the consumer dequeues from ReadAsync, this structure implementation can be found at SpscRecvRing.cs.

internal sealed class SpscRecvRing
{
   public struct Item { public ushort Bid; public int Len; public bool HasBuffer; }

   public bool TryEnqueue(in Item item);
   public bool TryDequeue(out Item item);
   public long SnapshotTail();
   public bool TryDequeueUntil(long tailSnapshot, out Item item);
   public bool IsEmpty();
}

One important characteristic is that this is a ring or circular buffer, which size must be a power of 2, this removes the need to ever need to clear or reset the buffer.
In this part 2 scope we only care about Len, the number of bytes received, Bid(Buffer Id) and HasBuffer are for returning the buffer to the kernel ring and will be properly covered in part 3.

The two interesting methods are SnapshotTail and TryDequeueUntil, together they let the consumer take a "picture" of what was available at that moment and drain everything up to that point in a single batch, without chasing a moving tail, as was explained above.

internal readonly struct RecvSnapshot
{
   public readonly long Tail;
   public readonly bool IsClosed;
   public RecvSnapshot(long tail, bool isClosed) { Tail = tail; IsClosed = isClosed; }
   public static RecvSnapshot Closed() => new(0, isClosed: true);
}

Before moving on I'd like to leave a "small" paragraph here, I've been confronted many times with "but why do you say run synchronously? Don't we want everything asynchronously to be more efficient?", so.. typically when we hear about asynchronous execution we think about parallelism, thread pool and multi threading, this is generally not wrong but not precise either. The typical async workflow is to tell the CPU to execute a piece of code/logic in the "future", a callback, promise or however we many name it, in C# this callback will execute in a thread from the thread pool (as typically .ConfigureAwait is set with false) and you could say that is one of the core mechanisms of c# await/async model which makes it so good. But in the end of the day if we can avoid this future callback and immediately execute the logic which can happen if the CQE was received before we call ReadAsync, this would be the most efficient scenario as we have if there is data already available when we call ReadAsync, short circuiting the whole callback and immediately execute it.

Delegating the Task execution to the thread pool sounds great but it comes with a price which becomes noticeable when we want to deal with millions of requests per second and more important, more threads does not always mean faster. Software runs on a CPU which has a limited number of CPU threads typically the same number of physical cores or 2x that value. While we can create thousands or tens of thousands of software threads in C#, those will run in a limited number of CPU threads. Here is where IValueTaskSource can shine, by default it is set to run the OnCompleted callback synchronously but do not get confused here, this synchronously means the callback will run on the same thread of the caller, not that will block the await.

Before moving into how IValueTaskSource is bridged between the handler and the reactor loop let's first understand the four flags/states owned by the Connection class.

private ManualResetValueTaskSourceCore<RecvSnapshot> _readSignal;     // wake signal carrying the snapshot
private int _armed;                                                    // 1 when handler is parked
private int _closed;                                                   // sticky once recv returned <=0 or send failed
private readonly SpscRecvRing _recv = new(capacityPow2: 16);           // buffered recv items

_armed means the consumer is parked. After enqueueing the data into the ring, the producer checks _armed, if set, it fires SetResult with a fresh snapshot to wake the consumer. If not armed, the data just sits in the ring waiting for the next ReadAsync to pick it up via the synchronous fast path. There is no pending data mechanism, the SPSC ring is the bridge from edge triggered wakes (one CQE at a time) to level-triggered consumption (the handler reads when it can).
_closed is sticky, once set all future ReadAsync will return immediately.

On the producer side

public void Complete(int res, ushort bid, bool hasBuffer) {
   if (res <= 0) {
       _closed = 1;
       if (hasBuffer) _reactor.ReturnBuffer(bid);   // defensive: kernel rarely attaches a buffer here
   }
   else if (!_recv.TryEnqueue(new SpscRecvRing.Item { Bid = bid, Len = res, HasBuffer = hasBuffer })) {
       _closed = 1;
       if (hasBuffer) _reactor.ReturnBuffer(bid);   // overflow → handler can't keep up; return the buffer, close
   }

   if (_armed == 1) {
       _armed = 0;
       _readSignal.SetResult(new RecvSnapshot(_recv.SnapshotTail(), _closed != 0));
   }
}

On failure path (res<=0 or can't enqueue the Item) we hand the buffer back to the kernel ring so that it doesn't slowly run out of recv buffers. If the consumer is already parked, wake it with SetResult and pass a snapshot of the ring's tail at this moment. The snapshot is what makes this meaningful, it basically tells the handler "everything from your current head up to this tail can be drained in s single batch", if three CQEs arrived back to back while the handler was busy, all three are visible in the next snapshot.

The consumer side

public ValueTask<RecvSnapshot> ReadAsync() {
   if (!_recv.IsEmpty())  // synchronous fast path
       return new ValueTask<RecvSnapshot>(new RecvSnapshot(_recv.SnapshotTail(), _closed != 0));

   if (_closed != 0)
       return new ValueTask<RecvSnapshot>(RecvSnapshot.Closed());

   if (_armed == 1)
       throw new InvalidOperationException("ReadAsync already armed.");
   _armed = 1;

   return new ValueTask<RecvSnapshot>(this, _readSignal.Version);
}

If the ring has data, take a fresh snapshot and return it as a complete ValueTask, no state machine yeild, allocation or wake required, the synchronous fast path.
If the ring is empty and the connection is closed, return a closed snapshot.
The check on _armed==1 is a guardrail, two simultaneous ReadAsync calls on the same connection would "clobber" each other's continuation.
Then the typical case, set _armed and handle the awaiter a ValueTask(this, _readSignal.Version), the handler suspends, the runtime stores the continuation on _readSignal and the thread is freed.

Draining the snapshot

So, once the handler receives a snapshot it pulls items out one by one with TryGetItem.

public bool TryGetItem(in RecvSnapshot snap, out SpscRecvRing.Item item)
   => _recv.TryDequeueUntil(snap.Tail, out item);

TryDequeueUntil advances the consumer's head as far as snapshot's tail even if the producer has since moved ahead. The handler always sees a stable batch.

RecvSnapshot snap = await conn.ReadAsync();

while (conn.TryGetItem(snap, out SpscRecvRing.Item item)) {
  // process item.Len bytes; in part 3 we will also use item.Bid to read the data
   if (item.HasBuffer) reactor.ReturnBuffer(item.Bid);
   conn.QueueResponse(fd);
}

if (snap.IsClosed) { conn.Close(fd); return; }
conn.ResetRead();

IValueTaskSource "magic"

In simple terms, the "trick" here is that the continuation callback will always run on the same thread, avoiding the thread pool cost. The reactor/dispatcher loop always runs on the same software AND CPU thread even though it is asynchronous and await boundaries are hit.

When a recv CQE arrives, the reactor calls Complete which enqueues into the ring and calls SetResult with a fresh snapshot. With RunContinuationsAsynchronously left at its default false and no captured SynchronizationContext, the stored continuation is invoked inline, on the reactor thread:

Reactor.Run
--Dispatch(recv_cqe)
----Connection.Complete(res, bid, hasBuffer)
------_recv.TryEnqueue(...)
------_readSignal.SetResult(new RecvSnapshot(tail, isClosed))
--------continuation(state) ← function pointer call
----------HandleAsync.MoveNext ← state machine resumes after the await
------------RecvSnapshot snap = result; ← GetResult returns the snapshot
------------while (TryGetItem(...)) { ... process, send response, return buffer ... }
------------ResetRead, await ReadAsync
------------// ...the loop starts over, still on the reactor thread

The handler runs on top of the reactor's call stack. No thread-pool hop, no scheduler. SetResult is, mechanically, just a function call into the handler's continuation that happens to carry a snapshot as its payload. When the handler finishes draining, hits the next await and parks, the stack unwinds back to Dispatch which moves on to the next CQE.

The reactor/dispatcher loop always runs on the same software AND CPU thread, and so does the handler — even though the code is fully asynchronous and await boundaries are hit on every iteration.

C# Networking Deep Dive With io_uring (part 1/5) - Toe Dipping

Diogo Martins — Wed, 29 Apr 2026 11:02:19 +0000

CQ - Completion Queue
SQ - Submission Queue
CQE - Completion Queue Entry
SQE - Submission Queue Entry

This post is the first part in a deep dive series on io_uring, it describes a basic example on how to bypass every abstraction and directly use the kernel interface for highest possible efficiency TCP networking using C# on Linux with io_uring. The source code used in this post can be found at zerg, project Minima - a simplified lightweight single threaded, lower performance version of zerg for learning purposes, do not use it for benchmarking purposes. The second and third parts will dive into more complex high performance cases and leveraging C# async I/O via IValueTaskSource.

io_uring is a Linux modern asynchronous I/O interface, the current traditional path is epoll_wait(to find out which sockets are ready) plus a separate syscall read/write/accept syscall to actually move bytes. Each of those crosses from user to kernel mode and back, this round trip is expensive. io_uring's novelty is to skip the syscalls on the hot path entirely. At startup we allocate two ring buffers in shared memory between our process and the kernel, a submission queue SQ to write descriptions of the work we want done and a completion queue CQ where the kernel posts the results, simple enough.

The following snippets reference the Ring class, a minimum viable io_uring wrapper written in C#. Ring owns the references to the completion and submission queues plus the logic for pushing entries onto one and reading completions off the other.

The queues are allocated by the kernel but they live in memory shared with our process.

IoUringParams ioUringParams = default;
int fd = io_uring_setup(entries, &ioUringParams);

io_uring_setup is the syscall that creates both queues inside the kernel. The kernel decides their sizes (rounding entries up to a power of two), allocates the memory, and returns a file descriptor plus a struct (ioUringParams) telling us where inside that memory every field lives: head, tail, mask, the SQE array, the CQE array. At this point the queues exist but we can't touch them yet.

Mapping them into our address space

void* ringMem = mmap(null, ringBytes, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQ_RING);   // SQ + CQ metadata
void* sqeMem = mmap(null, sqeBytes, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQES);      // SQE array

The first call maps both the SQ ring metadata and the entire CQ, typically these share one single region on modern kernels. The second call maps the SQE array which is a separate region. After these two calls the same physical memory is now visible to our process and the kernel. One subtle detail here is that CQEs are never "created", they are written to slots that already exist, the CQ is a ring buffer of pre-allocated empty CQE structs.

Then just cache the pointers to each specific field and our Ring is ready.

byte* ringPointer = (byte*)ringMem;
ring._sqHead  = (uint*)(ringPointer + ioUringParams.sq_off.head);
ring._sqTail  = (uint*)(ringPointer + ioUringParams.sq_off.tail);
ring._sqArray = (uint*)(ringPointer + ioUringParams.sq_off.array);
ring._sqMask  = *(uint*)(ringPointer + ioUringParams.sq_off.ring_mask);

ring._cqHead = (uint*)(ringPointer + ioUringParams.cq_off.head);
ring._cqTail = (uint*)(ringPointer + ioUringParams.cq_off.tail);
ring._cqes   = (IoUringCqe*)(ringPointer + ioUringParams.cq_off.cqes);
ring._cqMask = *(uint*)(ringPointer + ioUringParams.cq_off.ring_mask);

The Socket

The socket itself is not an io_uring concept, it is a plain Berkeley socket created with libc. io_uring does not reinvent the socket/bind/listen concepts and there is no gain in doing so as these are one-time setup calls, the syscall round-trips costs only matter when something runs millions of times per second.

A basic config TCP socket listening at loopback

private static int OpenListener(ushort port) {
   int fd = socket(AF_INET, SOCK_STREAM, 0);
   if (fd < 0) throw new InvalidOperationException($"socket failed: {fd}");

   int one = 1;
   setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(int));

   sockaddr_in addr = default;
   addr.sin_family      = AF_INET;
   addr.sin_port        = Htons(port);
   addr.sin_addr.s_addr = 0; // 0.0.0.0

   if (bind(fd, &addr, (uint)sizeof(sockaddr_in)) < 0)
      throw new InvalidOperationException("bind failed");

   if (listen(fd, Backlog) < 0)
      throw new InvalidOperationException("listen failed");
   return fd;
}

We have our socket and ring, now what?

Minima does not use io_uring multishot, this is a key feature that eliminates the need of a submission for every completion which drastically improves the overall performance. In this example for the sake of learning, it will not be used.

Submitting an accept SQE

Instead of calling accept and blocking as we would normally do with epoll, the SQE is described as an accept and we let the kernel fulfil it asynchronously.

private static void SubmitAccept(Ring ring, int listenFd) {
   IoUringSqe* sqe = ring.GetSqe();
   if (sqe == null) throw new InvalidOperationException("SQ full");

   Unsafe.InitBlockUnaligned(sqe, 0, 64);
   sqe->opcode    = IORING_OP_ACCEPT;
   sqe->fd        = listenFd;
   sqe->user_data = KindAccept | (uint)listenFd;
}

GetSqe() reserves a slot in the SQE array and returns a pointer to it. Here we fill the SQE, nothing has been sent to the kernel yet, just the local tail has been bumped.

opcode = IORING_OP_ACCEPT; fd = listenFd means we are accepting on this listening fd.

user_data is opaque to the kernel, whatever we write here comes back unchanged on the matching CQE. This way we know the kind of operation and fd on the CQE.

The SQE is now sitting in the array but the kernel has not seen it yet, submission will only happen when SubmitAndWait writes the kernel-visible tail, this means that we can queue as many SQEs cheaply and then flush them as a batch in one io_uring_enter call. This is half of the io_uring performance story.

The main loop

while (true) {
   int rc = ring.SubmitAndWait(1); // submit pending SQEs and block until 1+ CQE

   if (rc < 0 && rc != -4 /* EINTR */){
      Console.Error.WriteLine($"[minima] io_uring_enter failed: {rc}");
      break;
   }

   while (ring.TryGetCqe(out IoUringCqe cqe)){
      Dispatch(ring, listenFd, in cqe);
      ring.CqeSeen();
   }
}

Two loops, one syscall per outer iteration that publishes every queued SQE and blocks until the kernel posts at least one completion (CQE). Again, in a high performance scenario this would be done very differently, avoiding checking for completions entirely in the hot path, as will be covered in the part 2.

public int SubmitAndWait(uint waitFor) {
   uint published = *_sqTail;
   uint toSubmit  = _sqeTail - published;

   if (toSubmit > 0)
      Volatile.Write(ref *_sqTail, _sqeTail);

   if (toSubmit == 0 && waitFor == 0) return 0;

   uint flags = waitFor > 0 ? IORING_ENTER_GETEVENTS : 0;
   return io_uring_enter(_fd, toSubmit, waitFor, flags);
}

_sqeTail is our local cursor that we bump inside GetSqe, *_sqTail is the kernel visible cursor inside the mmap, the calculated difference between them is how many SQEs that we filled but not "announced".

Volatile.Write is a release fence, it ensures that the kernel sees every byte written into those SQE slots before it sees the new tail. Without this ordering there is a chance that the kernel could read the bumped tail and process an SQE that still contains stale data from a previous op.

After the fence, io_uring_enter does the rest, submits everything between the old and new tail and blocks until at least one CQE has been posted, both directions in a single call.

Reading completions

public bool TryGetCqe(out IoUringCqe cqe) {
  uint head = *_cqHead;
  uint tail = Volatile.Read(ref *_cqTail);

  if (head == tail) { cqe = default; return false; }

  cqe = _cqes[head & _cqMask];
  return true;
}

public void CqeSeen() => Volatile.Write(ref *_cqHead, *_cqHead + 1);

Again, the Volatile fence matches the acquire fence, so the moment we observe a new tail value the CQE data the kernel wrote are guaranteed visible.
If head equals tail we fallback to the outer SubmitAndWait, otherwise we read the CQE at head & mask which is very cheap as the ring is always power-of-two sized, hand it to Dispatch and bump _cqHead. This bump is what tells the kernel that slot is free to reuse for a future CQE.

Dispatching

private static void Dispatch(Ring ring, int listenFd, in IoUringCqe cqe) {
   ulong kind = cqe.user_data & 0xffffffff_00000000UL;
   int   fd   = (int)(cqe.user_data & 0xffffffffUL);

   if (kind == KindAccept) {
      if (cqe.res >= 0) {
         int clientFd = cqe.res;
         var conn = new Conn { Buffer = (byte*)NativeMemory.Alloc((nuint)BufferSize) };
         s_conns[clientFd] = conn;
         SubmitRecv(ring, clientFd, conn.Buffer, BufferSize);
      }
      SubmitAccept(ring, listenFd); // re-arm
   } else if (kind == KindRecv) {
       if (!s_conns.TryGetValue(fd, out var conn)) return;
       if (cqe.res <= 0) { CloseConn(fd, conn); return; }
       SubmitSend(ring, fd, conn.Buffer, (uint)cqe.res);
   } else if (kind == KindSend) {
       if (!s_conns.TryGetValue(fd, out var conn)) return;
       if (cqe.res <= 0) { CloseConn(fd, conn); return; }
       SubmitRecv(ring, fd, conn.Buffer, BufferSize);
   }
}

Here we crack the opaque user_data apart, retrieving the kind(high 32 bits) and fd(low 32 bits). Three branches, each moves the protocol forward by submitting the next SQE.

None of these SubmitX calls inside Dispatch enter the kernel, they just reserve a slot, fill it and bump the local tail, as we've seen before. Again, the flush only happens on the next outer iteration when SubmiAndWait runs, which means that a single io_uring_enter can absorb many completions and emit many submissions in a single go. This is one of the io_uring core concepts that make it scalable.

Going through what the dispatch branches do are not very important for this part 1 but:

KindAccept - accepts new connections
KindAccept - Receive data
KindSend - Kernel notifies about data we tried to send

In this example we are not really doing anything with the received data, this will be covered in the following parts, how to avoid allocations by directly reading from kernel shared memory, use modern io_uring features for zero allocations send and incremental buffers to reuse rings on small data reads.

HttpArena - Benchmark Web Frameworks

Diogo Martins — Mon, 20 Apr 2026 21:37:39 +0000

HttpArena is a recent project which goal is to build an open source platform where web frameworks are benchmarked for throughput performance, CPU usage, memory consumption and latency. Sounds familiar? Yes.. there are a few of them so what does HttpArena brings different?

Much broader test coverage including Http/1.1, Http/2, Http/3, gRPC and Websocket tests.
Community driven, we are not a company or sponsored by companies that compete in the benchmarks.
Tries to benchmark workloads closer to you see in real world applications including benchmarks with reverse proxies, caching services like redis and distributed systems.
Cares all about competitive fairness, entries are subdivided by Production, Tuned, Infrastructure and Engine, these have clear specific rules to avoid comparing apples to oranges and potatoes.

So, what does HttpArena wants to benchmark?

Both micro benchmarks and workload types, see the full test suite here.

Target audience

We target both framework developers and users, while micro benchmarks target development metrics, every test specially workload like ones can be useful for users when picking a technology or framework.

Hardware

Currently we run all the benchmarks in a single box, a 64 core AMD Threadripper with 256GB RAM. This has pros and cons, some pros are the fact that we don't get networking bottlenecked as we would if using a multiple box option like having 2 or more servers. The cons are sharing the cpu between server and load generators, while this is not optimal we take a lot of measures to minimize this such as running the servers and load generators in separate containers with pinned Cores for each, see Hardware and Topology.

Who can join?

Everyone is welcome! Not just to add a framework but also to improve the existing implementations and give us valuable feedback and ideas on existing and new tests.

Instant results

Open a PR and we benchmark it directly before approving or merging, results in 10 mins after opening PR of a maintainer is online.

HTTP11Probe Compliance Platform

Diogo Martins — Sat, 14 Feb 2026 17:00:03 +0000

An open testing platform that probes HTTP/1.1 servers against RFC 9110/9112 requirements, smuggling vectors, and malformed input handling. Add your framework, get compliance results automatically.

Platform Website

Check the Leaderboard for various web frameworks.

uRocket - Reactor Networking in C# with io_uring

Diogo Martins — Mon, 29 Dec 2025 23:45:10 +0000

As a network performance enthusiast I've worked with multiple HTTP web frameworks using the C# System.Net.Socket as the interface between the framework and the OS. Working mainly in Linux, one of the aspects that always frustrated me was the non-existent support for io_uring in C#(Socket uses epoll), so I guess, it was time to do it myself.

uRocket (micro ring socket) is a single acceptor multi reactor architecture with await/async support, this means that as a user I can await reads from the wire and write to it as I please. The acceptor and reactors are fully customizable relying on a C-written shim. Basically, uRocket (C#) interops with liburingshim (compiled from uringshim.c) which is an interface between the C# and liburing.

What is the reactor pattern and why it matters

The reactor pattern decouples I/O operations from application threads by using event notification to multiplex thousands of connections across a small thread pool. In uRocket, a single acceptor thread handles incoming connections via io_uring's multishot accept, distributing clients across reactor threads—each owning its own io_uring instance, buffer ring, and connection table. This architecture can eliminate thread-per-connection overhead while avoiding cross-thread contention entirely. With io_uring, reactors achieve unprecedented efficiency: submissions and completions occur through shared memory rings requiring zero syscalls in steady state, the kernel selects receive buffers directly from pre-registered rings (true zero-copy), and multishot operations fire hundreds of completions from a single submission. Application code can still spawn one task per connection for familiar async/await patterns—the reactor handles I/O multiplexing underneath while your code remains sequential and readable. This combination of reactor-pattern efficiency with idiomatic C# async/await is what makes uRocket unique.

io_uring

io_uring is a modern Linux kernel interface (introduced in 2019) that revolutionizes asynchronous I/O by replacing the traditional "syscall-per-operation" model with shared memory ring buffers. Instead of calling read(), write(), or accept() individually—each triggering expensive kernel transitions—applications submit I/O requests by writing entries to a Submission Queue (SQ) that lives in memory shared between userspace and the kernel. The kernel processes these asynchronously and reports completions via a Completion Queue, also in shared memory. Once initialized, most operations require zero syscalls: applications write SQEs (Submission Queue Events), the kernel polls for work (especially with SQPOLL mode), processes requests, and writes CQEs(Completion Queue Events)—all without crossing the userspace/kernel boundary. io_uring introduces powerful features beyond older APIs like epoll: multishot operations allow a single submission to produce hundreds of completions (one accept submission handles all incoming connections), buffer rings let the kernel select pre-registered buffers and return only a 16-bit ID rather than copying data, and batching enables processing thousands of events per iteration. This design eliminates the fundamental bottlenecks of traditional I/O: syscall overhead, data copies, and resubmission costs.

Benchmarking uRocket vs System.Net.Socket

Since I do not own multiple server machines or top of the like Network Interface Cards, there is of course some level of noise in these benchmarks. The load is generated using wrk and the source code for each:
uRocket
System.Net.Socket

A few notes:

Non pipelined requests
No HTTP parsing, this is not a HTTP framework benchmark.
No TCP fragmentation so each request - one response
Requests are sent through localhost and the load (wrk) is running on the same machine as the webservers, sadly due to budget issues which causes some bottleneck as we will see.
Both uRocket and System.NET.Sockets are built as native AoT with exact same flags:

<PropertyGroup>
    <ServerGarbageCollection>true</ServerGarbageCollection>
    <TieredPGO>true</TieredPGO>
    <SelfContained>true</SelfContained>
</PropertyGroup>

 <ItemGroup Condition="$(PublishAot) == 'true'">
        <RuntimeHostConfigurationOption Include="System.Threading.ThreadPool.HillClimbing.Disable" Value="true" />
</ItemGroup>

OS: Ubuntu Server 24.04, .NET 10
Processor: i9 14900K
RAM: 64GB 6000MHz

uRocket is still in early development phase so these results will likely be at least a little bit different in the future, take it with a little grain of salt, of course as the uRocket maintainer, I am biased. Maybe the code for System.Net.Socket could have some better optimization for the Socket configuration, I checked it vs Microsoft's asp net platform entry(uses System.Net.Socket) at TechEmpower benchmarks and it was significantly better performing so it looks legit to me.

Results

During the benchmarking I configured uRocket for a different number of reactors, always armed with multishot with or without SQPolling. The System.Net.Socket config remained always the same so this can be a point in favour of System.Net.Socket.

In the results table we can find:

The type (uRocket or Socket)
Nº of Reactors (only applies for uRocket)
Load (wrk command parameters)
CPU usage - i9 14900k has 32 Threads so each 100% - 1 Thread
RPS, Requests per Second (Average of 10 runs each)

High Load -t>16 -c512

uRocket delivers much less CPU usage even for higher RPS values, I noticed during the test that my hardware bottlenecks uRocket for nº of reactors > 12, this is because for higher RPS the load generator (wrk) also needs more CPU, we can see that for example for nº reactors < 12 the CPU usage is linear to the nº of reactors and then it plateaus.

We can also see that the perceived efficiency (RPS/CPU) is inversely proportional to the nº of reactors, best case is for 4 reactors(4377) and worst case for 16 reactors(2503) while for Socket all results are similar(~1700), which makes sense because as there is no thread pinning, there is an OS optimization to run the reactors on the best CPU threads i9 14900k has few performance cores, also the wrk load is lower.

Lower Load -t<16

For lower load Socket somehow pulls a lot of CPU usage, the results are way too favorable towards io_uring seeing 4x better RPS/CPU ration for many cases, again, the Socket implementation might not be fully optimized.

Conclusion

When I started this benchmarking I had one reference from an older reddit post stating 23% extra performance for io_uring which can actually be seen for the maximum RPS 3_354_231 vs 2_728_015 (~23% more performance). I also found online in other benchmarks that typically io_uring solutions consume up to 50% less CPU usage so it also checks out. Hope this was interesting for you and again, this is a simple benchmark and may have inaccuracies.

Future Work

uRocket still requires some features and polishing, after that it will be integrated into frameworks such as Wired.IO and GenHTTP to test it on an actual HTTP framework.

Latency and startup time tests are also planned as it's extremely fast booting specially with native AoT.

From Serial Ports to WebSockets: Debugging Across Two Worlds

Diogo Martins — Tue, 23 Dec 2025 14:07:19 +0000

As an embedded C developer, I can say that I spend some (more than I wish) time in what I usually call the debugging loop: build binaries → flash → execute → measure some signal on my oscilloscope, rinse and repeat. Unlike high-level software development, it is often not simple to extract the information we need while debugging. One of the most common techniques is to wire up a simple UART serial port communication between a microcontroller and a PC and log some messages while the firmware is running — such a fantastic tool: full-duplex, easy to configure, and reliable communication between two targets.

For over a year now, I’ve been delving into the world of networking, and once again I often find myself needing to take advantage of a channel for debugging — but this time, a different one: the TCP channel. As a Linux user, higher-level languages like Java or Python are quite handy for wiring up a simple TCP socket and flushing some bytes up and down. However, when it comes to browsers, things are not so simple. We need to follow a protocol supported by the browser, such as WebSockets, which are not as simple as they might appear.
A typical use case I am faced with is connecting a Linux-based embedded system — which typically has no visual output — to my development machine, which hosts a simple frontend application that allows me to debug and monitor multiple external systems.

What I did not expect is that one day I would be using C# as my main high-level programming language on Linux. Big props to Microsoft and the fantastic work done with .NET cross-platform. Programming languages are tools, and coming from C, C# offers great value when it comes to quickly deploying something — whether for debugging, a DevOps script, or a quick prototype — while still providing the option of manual memory control and surprisingly high performance, awkwardly close to C++ or Rust.

Enter GenHTTP.
A third-party C# library that quickly rose to my list of favorites. The sheer utility it provides for building a quick HTTP web server is unparalleled compared to everything I’ve used, from Python to Java to C#. Today, I’d love to present a small piece of code showing how to wire up a very simple WebSocket using this library.

For the more curious, here is the official documentation on how to build a WebSocket with GenHTTP:

Echo WebSocket Server

using GenHTTP.Engine.Internal;
using GenHTTP.Modules.Websockets;

var websocket = Websocket.Functional()
    .OnMessage(async (connection, message) =>
    {
        await connection.WriteAsync(message.Data);
    });

await Host.Create()
    .Handler(websocket)
    .RunAsync();

This tiny piece of code hosts a server at localhost:8080 and can be easily modified to fit your needs. There are multiple flavors available, but I prefer the functional one, as it keeps everything more compact for me.

There is, of course, a lot more you can do with this powerful library when it comes to WebSockets. Personally, I often find myself doing very basic things, and for that use case, I extract a lot of value from it.