Sumant

Posted on May 5

Building an Async Rust Runtime on io_uring: 7.5ms vs Tokio's 14.9ms

#rust #linux #opensource #programming

You use async/await every day. But do you know what actually happens when your code "pauses"? I didn't, so I built something to find out.

The result is RingCore, a minimal async runtime in Rust, built directly on Linux's io_uring, with zero abstraction layers in the way. No Tokio. No hidden thread pools. Just Rust, a kernel interface, and a lot of curiosity.

The Question That Started Everything

If you've written async Rust, you've probably typed this:

let data = file.read().await;

And it just works. The program doesn't freeze. Other tasks keep running.

But I kept asking: what is actually happening when .await suspends a task? Where does execution go? Who wakes it back up? How does the OS fit into any of this?

Most tutorials stop at "the runtime handles it." That answer never satisfied me.

Why Does Async Exist at All?

Imagine you're a chef in a kitchen. You put a steak on the grill and just stand there watching it cook. You don't prep the salad. You don't plate the dessert. You just wait.

That's synchronous I/O. Your program calls read(), the OS fetches data from disk or the network, and your thread sits idle until it comes back. Wasteful.

Async I/O lets you be a smarter chef. You start the steak, set a timer, and go do other things. When the timer fires, you come back and finish.

In Rust, async/await is the language-level mechanism for writing this kind of code. But Rust itself doesn't define how the waiting works, that's the runtime's job. Most people reach for Tokio, which is fantastic and production-ready. But it's also a black box.

I wanted the white box.

Enter io_uring: The Kernel's Secret Weapon

Traditional async I/O on Linux is expensive. Every interaction with the kernel requires a context switch, which is a CPU jump from user mode (your program) into kernel mode (the OS) and back. Under heavy I/O load, these add up fast.

io_uring, introduced in Linux 5.1 by kernel developer Jens Axboe, takes a radically different approach. Instead of making individual system calls, your program and the kernel share two ring buffers in memory:

Submission Queue (SQ): You write your I/O requests here.
Completion Queue (CQ): The kernel writes results back here.

Think of it like a diner counter with a ticket window. Instead of running to the kitchen for every order, you slide all your tickets through the window at once and the kitchen slides the finished plates back. One trip. Maximum efficiency.

Multiple I/O operations can be batched into a single io_uring_enter system call. Context switches plummet. Performance soars.

How RingCore Works: A Tour of the Four Layers

Layer 1: Talking to the Kernel (`src/sys.rs`, `src/ring.rs`)

The lowest layer handles raw kernel communication. No OS library wrappers. No abstraction. RingCore manually invokes SYS_IO_URING_SETUP and SYS_IO_URING_ENTER via libc, and uses mmap to map the kernel's SQ and CQ ring buffers directly into the process's address space.

// Manually invoke the io_uring_setup syscall
let ring_fd = unsafe {
    libc::syscall(
        libc::SYS_io_uring_setup,
        QUEUE_DEPTH as libc::c_long,
        &params as *const _ as libc::c_long,
    )
} as i32;

// Map the Submission Queue into our address space
let sq_ptr = unsafe {
    libc::mmap(
        std::ptr::null_mut(),
        sq_size,
        libc::PROT_READ | libc::PROT_WRITE,
        libc::MAP_SHARED | libc::MAP_POPULATE,
        ring_fd,
        libc::IORING_OFF_SQ_RING as libc::off_t,
    )
};

This is the part most async tutorials skip entirely. In RingCore, it's front and center.

Layer 2: Wrapping Operations in Futures (`src/op.rs`)

This is where things get interesting. Rust's Future trait is simple: poll it, get Poll::Ready(value) if the result is done, or Poll::Pending if not along with a Waker so someone can nudge it later.

In RingCore, every io_uring operation becomes a Future. Here's the key poll implementation:

impl Future for Op {
    type Output = i32;

    fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
        // If this is the first poll, submit the SQE to the ring
        if !self.submitted {
            RING.with(|ring| {
                let mut ring = ring.borrow_mut();
                // Write the Submission Queue Entry to the shared kernel buffer
                ring.push_sqe(self.sqe);
            });

            // Store the Waker in a global map, keyed by our unique operation ID
            // The executor will retrieve this when the kernel signals completion
            WAKER_MAP.with(|map| {
                map.borrow_mut().insert(self.user_data, cx.waker().clone());
            });

            self.submitted = true;
            return Poll::Pending; // Go away, we'll call you when the kernel is done
        }

        // Check if our Completion Queue Entry has arrived
        match self.result.take() {
            Some(res) => Poll::Ready(res),
            None => {
                // Update the waker and keep waiting
                WAKER_MAP.with(|map| {
                    map.borrow_mut().insert(self.user_data, cx.waker().clone());
                });
                Poll::Pending
            }
        }
    }
}

The elegant part: when the kernel finishes and writes a CQE with a matching ID, the executor retrieves the stored Waker and calls it. No magic, it's just a map, an ID, and a callback.

Layer 3: The Executor (`src/executor.rs`)

The executor is the brain that orchestrates everything. Its main loop is beautifully simple:

pub fn run(&mut self) {
    loop {
        // Step 1: Poll all tasks that have been woken up
        while let Some(task) = self.ready_queue.pop_front() {
            let waker = task.waker();
            let mut cx = Context::from_waker(&waker);

            match task.future.borrow_mut().as_mut().poll(&mut cx) {
                Poll::Ready(_) => { /* Task complete, drop it */ }
                Poll::Pending => { /* Task is waiting on I/O, leave it */ }
            }
        }

        // Step 2: Submit pending SQEs and harvest completed CQEs
        // min_complete=1 means: block until at least one operation finishes
        // This puts the thread to sleep until the kernel has work for us
        let completed = self.ring.submit_and_wait(1);

        // Step 3: For each completed operation, wake the waiting task
        for cqe in completed {
            WAKER_MAP.with(|map| {
                if let Some(waker) = map.borrow_mut().remove(&cqe.user_data) {
                    // Store the result, then wake the future
                    store_result(cqe.user_data, cqe.res);
                    waker.wake();
                }
            });
        }

        if self.all_tasks_complete() {
            break;
        }
    }
}

This is a classic event loop similar in spirit to Node.js, but with direct kernel access instead of libuv underneath.

Layer 4: Friendly Wrappers (`src/net.rs`)

The top layer gives you TcpListener and TcpStream with clean async fn methods. They feel like normal Rust networking but under the hood, they're submitting SQEs to the ring.

impl TcpStream {
    pub async fn read(&self, buf: &mut [u8]) -> io::Result<usize> {
        // This creates an Op future that submits IORING_OP_READ
        // and suspends until the kernel completes it
        let result = Op::read(self.fd, buf).await;
        if result < 0 {
            Err(io::Error::from_raw_os_error(-result))
        } else {
            Ok(result as usize)
        }
    }
}

The whole stack, four files, clean separation, nothing hidden.

The Benchmarks

Tested on Debian 13, Kernel 6.12. Comparing RingCore against std and Tokio.

File I/O : reading a 100MB file

Runtime	Real Time	System Time
`std::fs` (synchronous)	0.057s	0.016s
Tokio (epoll + thread pool)	0.461s	0.376s
RingCore (io_uring)	0.088s	0.036s

Tokio is 5× slower here. Why? Tokio doesn't use io_uring for file I/O by default, it offloads blocking file reads to a thread pool, which adds significant overhead. RingCore uses true async kernel operations.

Networking : sequential and concurrent requests

Test Case	std (threaded)	Tokio (epoll)	RingCore (io_uring)
100 sequential requests	12.8ms	14.9ms	7.5ms
1,000 concurrent requests	48.3ms	1,080ms	67.9ms

The 1,000-request stress test is the eye-opener. Tokio takes over a second because its thread-per-task model drowns in scheduling overhead at scale. RingCore handles all of it on a single thread, with the kernel doing the heavy lifting.

Advanced: kernel-level task chaining

Using IOSQE_IO_LINK, RingCore chains dependent operations (like Read → Write) so the kernel executes them back-to-back without ever returning to userspace. One io_uring_enter call. Zero ping-pong.

The Mental Model That Changes Everything

Here's what building RingCore made concrete for me, the thing no tutorial made clear before:

When you .await something in Rust, you're saying:

"I'm not ready yet. Here's my callback (the Waker). Come get me when something changes."

The executor moves on to other tasks. The kernel works in the background. When the kernel is done, it writes a CQE. The executor reads it, finds the matching Waker in the map, and calls it. Your task wakes up and continues from where it left off.

That's the entire model. RingCore makes every step of it visible and there's no layer you can't read.

What's in the Repo

Examples are organized into four tiers so you can explore progressively:

Tier 1 : Proving the runtime

cargo run --example echo          # Chained Accept → Read → Write
cargo run --example cat -- <file> # File I/O in isolation
cargo run --example timer         # Task parking and waking without I/O

Tier 2 : The async model

cargo run --example concurrent_downloads  # 100 SQEs submitted simultaneously
cargo run --example timeout_race          # Operation cancellation via IORING_OP_ASYNC_CANCEL

Tier 3 : Real workloads

cargo run --example http_server   # High-concurrency "Hello World"
cargo run --example file_server   # Serving static files over TCP

Tier 4 : Advanced features

sudo cargo run --example sqpoll         # Kernel-side SQ polling (needs CAP_SYS_ADMIN)
cargo run --example linked_cat -- <file> # Chained Read + Write at kernel level
cargo run --example multishot_accept    # One SQE → infinite connection CQEs

Start with echo, trace through the source, and you'll have a complete mental model of async I/O in about an afternoon.

Requirements

Linux 5.10+ for stable IORING_OP_ACCEPT support
x86_64 architecture
Dependencies: libc and std only

[dependencies]
ringcore = "0.1.0"

Why Build This Instead of Just Using Tokio?

Tokio is the right choice for production. I'm not suggesting you replace it.

But if you've ever stared at a select! macro, a JoinHandle, or a .await and wondered what is actually happening in the kernel right now, building something like RingCore is the answer.

I'm not intimidated by async Rust anymore. Not because it got simpler, but because I can now see every moving part. The abstraction didn't disappear, I just understand what it's abstracting.

This is Part of a Series

RingCore isn't the first time I've gone down this rabbit hole. A few weeks ago I also built a container engine in Rust that starts in 10ms, cracking open Linux namespaces, cgroups, and clone() syscalls along the way.

The two projects rhyme. With the container engine I asked: what actually happens when you run a container? With RingCore I asked: what actually happens when you .await?

Both answers live in the kernel. Both are learnable. The best way to demystify them is to build a tiny, intentionally incomplete version yourself.

DEV Community

Building an Async Rust Runtime on io_uring: 7.5ms vs Tokio's 14.9ms

The Question That Started Everything

Why Does Async Exist at All?

Enter io_uring: The Kernel's Secret Weapon

How RingCore Works: A Tour of the Four Layers

Layer 1: Talking to the Kernel (`src/sys.rs`, `src/ring.rs`)

Layer 2: Wrapping Operations in Futures (`src/op.rs`)

Layer 3: The Executor (`src/executor.rs`)

Layer 4: Friendly Wrappers (`src/net.rs`)

The Benchmarks

File I/O : reading a 100MB file

Networking : sequential and concurrent requests

Advanced: kernel-level task chaining

The Mental Model That Changes Everything

What's in the Repo

Requirements

Why Build This Instead of Just Using Tokio?

This is Part of a Series

Links

Top comments (0)

The Question That Started Everything

Why Does Async Exist at All?

Enter io_uring: The Kernel's Secret Weapon

How RingCore Works: A Tour of the Four Layers

Layer 1: Talking to the Kernel (src/sys.rs, src/ring.rs)

Layer 2: Wrapping Operations in Futures (src/op.rs)

Layer 3: The Executor (src/executor.rs)

Layer 4: Friendly Wrappers (src/net.rs)

The Benchmarks

File I/O : reading a 100MB file

Networking : sequential and concurrent requests

Advanced: kernel-level task chaining

The Mental Model That Changes Everything

What's in the Repo

Requirements

Why Build This Instead of Just Using Tokio?

This is Part of a Series

Links

Layer 1: Talking to the Kernel (`src/sys.rs`, `src/ring.rs`)

Layer 2: Wrapping Operations in Futures (`src/op.rs`)

Layer 3: The Executor (`src/executor.rs`)

Layer 4: Friendly Wrappers (`src/net.rs`)