Diogo Martins

Posted on Apr 29

C# Networking Deep Dive With io_uring (part 1/5) - Toe Dipping

#csharp #linux #performance #networking

CQ - Completion Queue
SQ - Submission Queue
CQE - Completion Queue Entry
SQE - Submission Queue Entry

This post is the first part in a deep dive series on io_uring, it describes a basic example on how to bypass every abstraction and directly use the kernel interface for highest possible efficiency TCP networking using C# on Linux with io_uring. The source code used in this post can be found at zerg, project Minima - a simplified lightweight single threaded, lower performance version of zerg for learning purposes, do not use it for benchmarking purposes. The second and third parts will dive into more complex high performance cases and leveraging C# async I/O via IValueTaskSource.

io_uring is a Linux modern asynchronous I/O interface, the current traditional path is epoll_wait(to find out which sockets are ready) plus a separate syscall read/write/accept syscall to actually move bytes. Each of those crosses from user to kernel mode and back, this round trip is expensive. io_uring's novelty is to skip the syscalls on the hot path entirely. At startup we allocate two ring buffers in shared memory between our process and the kernel, a submission queue SQ to write descriptions of the work we want done and a completion queue CQ where the kernel posts the results, simple enough.

The following snippets reference the Ring class, a minimum viable io_uring wrapper written in C#. Ring owns the references to the completion and submission queues plus the logic for pushing entries onto one and reading completions off the other.

The queues are allocated by the kernel but they live in memory shared with our process.

IoUringParams ioUringParams = default;
int fd = io_uring_setup(entries, &ioUringParams);

io_uring_setup is the syscall that creates both queues inside the kernel. The kernel decides their sizes (rounding entries up to a power of two), allocates the memory, and returns a file descriptor plus a struct (ioUringParams) telling us where inside that memory every field lives: head, tail, mask, the SQE array, the CQE array. At this point the queues exist but we can't touch them yet.

Mapping them into our address space

void* ringMem = mmap(null, ringBytes, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQ_RING);   // SQ + CQ metadata
void* sqeMem = mmap(null, sqeBytes, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQES);      // SQE array

The first call maps both the SQ ring metadata and the entire CQ, typically these share one single region on modern kernels. The second call maps the SQE array which is a separate region. After these two calls the same physical memory is now visible to our process and the kernel. One subtle detail here is that CQEs are never "created", they are written to slots that already exist, the CQ is a ring buffer of pre-allocated empty CQE structs.

Then just cache the pointers to each specific field and our Ring is ready.

byte* ringPointer = (byte*)ringMem;
ring._sqHead  = (uint*)(ringPointer + ioUringParams.sq_off.head);
ring._sqTail  = (uint*)(ringPointer + ioUringParams.sq_off.tail);
ring._sqArray = (uint*)(ringPointer + ioUringParams.sq_off.array);
ring._sqMask  = *(uint*)(ringPointer + ioUringParams.sq_off.ring_mask);

ring._cqHead = (uint*)(ringPointer + ioUringParams.cq_off.head);
ring._cqTail = (uint*)(ringPointer + ioUringParams.cq_off.tail);
ring._cqes   = (IoUringCqe*)(ringPointer + ioUringParams.cq_off.cqes);
ring._cqMask = *(uint*)(ringPointer + ioUringParams.cq_off.ring_mask);

The Socket

The socket itself is not an io_uring concept, it is a plain Berkeley socket created with libc. io_uring does not reinvent the socket/bind/listen concepts and there is no gain in doing so as these are one-time setup calls, the syscall round-trips costs only matter when something runs millions of times per second.

A basic config TCP socket listening at loopback

private static int OpenListener(ushort port) {
   int fd = socket(AF_INET, SOCK_STREAM, 0);
   if (fd < 0) throw new InvalidOperationException($"socket failed: {fd}");

   int one = 1;
   setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(int));

   sockaddr_in addr = default;
   addr.sin_family      = AF_INET;
   addr.sin_port        = Htons(port);
   addr.sin_addr.s_addr = 0; // 0.0.0.0

   if (bind(fd, &addr, (uint)sizeof(sockaddr_in)) < 0)
      throw new InvalidOperationException("bind failed");

   if (listen(fd, Backlog) < 0)
      throw new InvalidOperationException("listen failed");
   return fd;
}

We have our socket and ring, now what?

Minima does not use io_uring multishot, this is a key feature that eliminates the need of a submission for every completion which drastically improves the overall performance. In this example for the sake of learning, it will not be used.

Submitting an accept SQE

Instead of calling accept and blocking as we would normally do with epoll, the SQE is described as an accept and we let the kernel fulfil it asynchronously.

private static void SubmitAccept(Ring ring, int listenFd) {
   IoUringSqe* sqe = ring.GetSqe();
   if (sqe == null) throw new InvalidOperationException("SQ full");

   Unsafe.InitBlockUnaligned(sqe, 0, 64);
   sqe->opcode    = IORING_OP_ACCEPT;
   sqe->fd        = listenFd;
   sqe->user_data = KindAccept | (uint)listenFd;
}

GetSqe() reserves a slot in the SQE array and returns a pointer to it. Here we fill the SQE, nothing has been sent to the kernel yet, just the local tail has been bumped.

opcode = IORING_OP_ACCEPT; fd = listenFd means we are accepting on this listening fd.

user_data is opaque to the kernel, whatever we write here comes back unchanged on the matching CQE. This way we know the kind of operation and fd on the CQE.

The SQE is now sitting in the array but the kernel has not seen it yet, submission will only happen when SubmitAndWait writes the kernel-visible tail, this means that we can queue as many SQEs cheaply and then flush them as a batch in one io_uring_enter call. This is half of the io_uring performance story.

The main loop

while (true) {
   int rc = ring.SubmitAndWait(1); // submit pending SQEs and block until 1+ CQE

   if (rc < 0 && rc != -4 /* EINTR */){
      Console.Error.WriteLine($"[minima] io_uring_enter failed: {rc}");
      break;
   }

   while (ring.TryGetCqe(out IoUringCqe cqe)){
      Dispatch(ring, listenFd, in cqe);
      ring.CqeSeen();
   }
}

Two loops, one syscall per outer iteration that publishes every queued SQE and blocks until the kernel posts at least one completion (CQE). Again, in a high performance scenario this would be done very differently, avoiding checking for completions entirely in the hot path, as will be covered in the part 2.

public int SubmitAndWait(uint waitFor) {
   uint published = *_sqTail;
   uint toSubmit  = _sqeTail - published;

   if (toSubmit > 0)
      Volatile.Write(ref *_sqTail, _sqeTail);

   if (toSubmit == 0 && waitFor == 0) return 0;

   uint flags = waitFor > 0 ? IORING_ENTER_GETEVENTS : 0;
   return io_uring_enter(_fd, toSubmit, waitFor, flags);
}

_sqeTail is our local cursor that we bump inside GetSqe, *_sqTail is the kernel visible cursor inside the mmap, the calculated difference between them is how many SQEs that we filled but not "announced".

Volatile.Write is a release fence, it ensures that the kernel sees every byte written into those SQE slots before it sees the new tail. Without this ordering there is a chance that the kernel could read the bumped tail and process an SQE that still contains stale data from a previous op.

After the fence, io_uring_enter does the rest, submits everything between the old and new tail and blocks until at least one CQE has been posted, both directions in a single call.

Reading completions

public bool TryGetCqe(out IoUringCqe cqe) {
  uint head = *_cqHead;
  uint tail = Volatile.Read(ref *_cqTail);

  if (head == tail) { cqe = default; return false; }

  cqe = _cqes[head & _cqMask];
  return true;
}

public void CqeSeen() => Volatile.Write(ref *_cqHead, *_cqHead + 1);

Again, the Volatile fence matches the acquire fence, so the moment we observe a new tail value the CQE data the kernel wrote are guaranteed visible.
If head equals tail we fallback to the outer SubmitAndWait, otherwise we read the CQE at head & mask which is very cheap as the ring is always power-of-two sized, hand it to Dispatch and bump _cqHead. This bump is what tells the kernel that slot is free to reuse for a future CQE.

Dispatching

private static void Dispatch(Ring ring, int listenFd, in IoUringCqe cqe) {
   ulong kind = cqe.user_data & 0xffffffff_00000000UL;
   int   fd   = (int)(cqe.user_data & 0xffffffffUL);

   if (kind == KindAccept) {
      if (cqe.res >= 0) {
         int clientFd = cqe.res;
         var conn = new Conn { Buffer = (byte*)NativeMemory.Alloc((nuint)BufferSize) };
         s_conns[clientFd] = conn;
         SubmitRecv(ring, clientFd, conn.Buffer, BufferSize);
      }
      SubmitAccept(ring, listenFd); // re-arm
   } else if (kind == KindRecv) {
       if (!s_conns.TryGetValue(fd, out var conn)) return;
       if (cqe.res <= 0) { CloseConn(fd, conn); return; }
       SubmitSend(ring, fd, conn.Buffer, (uint)cqe.res);
   } else if (kind == KindSend) {
       if (!s_conns.TryGetValue(fd, out var conn)) return;
       if (cqe.res <= 0) { CloseConn(fd, conn); return; }
       SubmitRecv(ring, fd, conn.Buffer, BufferSize);
   }
}

Here we crack the opaque user_data apart, retrieving the kind(high 32 bits) and fd(low 32 bits). Three branches, each moves the protocol forward by submitting the next SQE.

None of these SubmitX calls inside Dispatch enter the kernel, they just reserve a slot, fill it and bump the local tail, as we've seen before. Again, the flush only happens on the next outer iteration when SubmiAndWait runs, which means that a single io_uring_enter can absorb many completions and emit many submissions in a single go. This is one of the io_uring core concepts that make it scalable.

Going through what the dispatch branches do are not very important for this part 1 but:

KindAccept - accepts new connections
KindAccept - Receive data
KindSend - Kernel notifies about data we tried to send

In this example we are not really doing anything with the received data, this will be covered in the following parts, how to avoid allocations by directly reading from kernel shared memory, use modern io_uring features for zero allocations send and incremental buffers to reuse rings on small data reads.

DEV Community

C# Networking Deep Dive With io_uring (part 1/5) - Toe Dipping

The Socket

Submitting an accept SQE

The main loop

Top comments (0)