Egeo Minotti

Posted on Jan 21

io_uring How flashQ Achieves Kernel-Level Async IO Performance

#rust #linux #performance #backend

When building flashQ, I faced a fundamental challenge: how do you build a job queue that can handle hundreds of thousands of operations per second while maintaining low latency?

The answer lies in one of Linux's most significant kernel innovations in the past decade: io_uring.

The Problem with Traditional Async I/O

Before io_uring, Linux applications had two primary options for handling I/O.

Blocking I/O with Thread Pools

The traditional approach: spawn threads, let them block on I/O operations. Simple, but inefficient:

Context switching overhead: Each thread switch costs 1-10 microseconds
Memory overhead: Each thread requires its own stack (typically 8MB)
Scalability limits: Thousands of concurrent connections = thousands of threads

Event-driven I/O (epoll/kqueue)

Modern async runtimes like Tokio use epoll (Linux) or kqueue (macOS/BSD):

int epfd = epoll_create1(0);
epoll_ctl(epfd, EPOLL_CTL_ADD, sockfd, &event);

while (1) {
    int n = epoll_wait(epfd, events, MAX_EVENTS, -1);  // syscall
    for (int i = 0; i < n; i++) {
        read(events[i].data.fd, buf, len);  // syscall
        process(buf);
        write(events[i].data.fd, response, len);  // syscall
    }
}

Better than threads, but still problematic:

Syscall overhead: Every read(), write(), accept() requires a kernel transition
Data copying: Data must be copied between kernel and user space
Notification only: epoll tells you a socket is ready, but you still need syscalls to do actual I/O

For high-throughput applications, these syscalls become the bottleneck. A server handling 100K requests/second makes 300K+ syscalls per second just for basic I/O.

Enter io_uring: A Paradigm Shift

Introduced in Linux 5.1 (2019) by Jens Axboe, io_uring fundamentally changes how applications interact with the kernel for I/O operations.

The Core Innovation: Shared Ring Buffers

io_uring creates two ring buffers shared between user space and kernel space:

+-------------------------------------------------------------+
|                      User Space                              |
|  +---------------------+    +---------------------+         |
|  |  Submission Queue   |    |  Completion Queue   |         |
|  |  (SQ) - Requests    |    |  (CQ) - Results     |         |
|  |                     |    |                     |         |
|  |  [read fd=5, ...]   |    |  [done, 1024 bytes] |         |
|  |  [write fd=7, ...]  |    |  [done, 512 bytes]  |         |
|  |  [accept fd=3, ...] |    |  [error, EAGAIN]    |         |
|  +----------+----------+    +----------^----------+         |
|             | shared memory            |                     |
+-------------|--------------------------+---------------------+
|             |      Kernel Space        |                     |
|             v                          |                     |
|  +-------------------------------------+---+                |
|  |           io_uring Subsystem            |                |
|  |                                         |                |
|  |   - Processes SQ entries                |                |
|  |   - Performs actual I/O                 |                |
|  |   - Posts results to CQ                 |                |
|  +-----------------------------------------+                |
+-------------------------------------------------------------+

Key Benefits at a Glance

Feature	Traditional (epoll)	io_uring
Syscalls per I/O	1-3 per operation	0 (batched submission)
Data copies	User - Kernel	Zero-copy possible
Batching	Not native	Submit hundreds at once
Kernel polling	No	Yes (SQPOLL mode)
Fixed buffers	No	Yes (registered buffers)

How flashQ Uses io_uring

flashQ is written in Rust, leveraging the tokio-uring crate for io_uring support. Here's how we integrate it.

Runtime Detection

flashQ automatically detects the optimal I/O backend at startup:

pub fn select_io_backend() -> IoBackend {
    #[cfg(target_os = "linux")]
    {
        if io_uring_available() && kernel_version() >= (5, 1) {
            return IoBackend::IoUring;
        }
        return IoBackend::Epoll;
    }

    #[cfg(target_os = "macos")]
    return IoBackend::Kqueue;

    #[cfg(target_os = "windows")]
    return IoBackend::Iocp;
}

You'll see the active backend in the startup logs:

# Linux with io_uring
INFO flashq_server::runtime: IO backend: io_uring (kernel-level async)

# macOS
INFO flashq_server::runtime: IO backend: kqueue (poll-based async)

Batched Operations

One of io_uring's biggest advantages is batching. Instead of making individual syscalls, flashQ batches multiple operations:

async fn handle_connections(ring: &IoUring) {
    let mut submissions = Vec::with_capacity(32);

    // Collect pending operations
    for conn in pending_connections.drain(..) {
        submissions.push(ReadOp::new(conn.fd, conn.buffer));
    }

    // Submit all at once - single syscall for 32 operations
    ring.submit_batch(&submissions).await;

    // Process completions
    for completion in ring.completions() {
        handle_completion(completion);
    }
}

This reduces syscall overhead by 95%+ under high load.

Zero-Copy Networking

With registered buffers, flashQ can perform true zero-copy I/O:

// Register buffers once at startup
let buffers = IoUring::register_buffers(
    (0..BUFFER_COUNT)
        .map(|_| vec![0u8; BUFFER_SIZE])
        .collect()
);

// Use registered buffers for I/O - no copying!
async fn read_message(fd: RawFd, buf_idx: u16) -> io::Result<usize> {
    ring.read_fixed(fd, buf_idx, 0).await
}

Data flows directly from network card to application memory without intermediate kernel buffer copies.

SQPOLL Mode for Ultra-Low Latency

In SQPOLL mode, the kernel continuously polls the submission queue without requiring any syscalls:

let ring = IoUring::builder()
    .setup_sqpoll(2000)  // Poll for 2ms before sleeping
    .build()?;

// Submissions are picked up automatically by kernel thread
// No io_uring_enter() syscall needed!

This is ideal for latency-sensitive workloads where every microsecond counts.

Performance Impact

I benchmarked flashQ with and without io_uring on identical hardware (AMD EPYC 7763, 64 cores, 128GB RAM).

Throughput Comparison

Metric	epoll	io_uring	Improvement
Jobs pushed/sec	245,000	312,000	+27%
Jobs processed/sec	180,000	228,000	+26%
P99 latency (push)	1.8ms	0.9ms	-50%
P99 latency (fetch)	2.1ms	1.1ms	-48%
CPU usage at 100K/s	45%	31%	-31%
Syscalls/sec at 100K ops	~320,000	~12,000	-96%

Latency Distribution

Push Latency Distribution (100K jobs/sec sustained)

epoll:
  P50:  0.4ms  ################
  P90:  1.2ms  ########################################
  P99:  1.8ms  ############################################################
  P999: 4.2ms  ######################################################################

io_uring:
  P50:  0.2ms  ########
  P90:  0.6ms  ####################
  P99:  0.9ms  ##############################
  P999: 1.8ms  ############################################################

Getting Started

Docker (Recommended)

# Our official image has io_uring enabled
docker run -d -p 6789:6789 ghcr.io/egeominotti/flashq:latest

# Verify io_uring is active
docker logs flashq | grep "IO backend"

Building from Source

git clone https://github.com/egeominotti/flashq.git
cd flashq

# Build with io_uring feature
cargo build --release --features io-uring

./target/release/flashq-server

Requirements:

Linux kernel 5.1+ (5.10+ recommended)
liburing installed
For SQPOLL mode: CAP_SYS_ADMIN or root

Platform Compatibility

flashQ runs everywhere with automatic backend selection:

Platform	I/O Backend	Notes
Linux (kernel 5.1+)	io_uring	Fastest, kernel-level async
Linux (older kernels)	epoll	Fast, poll-based
macOS	kqueue	Native, optimal for macOS
Windows	IOCP	Native, optimal for Windows

When io_uring Makes the Biggest Difference

High Connection Count - With thousands of concurrent connections, the syscall reduction is dramatic.

High Throughput Workloads - AI workloads pushing hundreds of thousands of jobs benefit enormously from the 27% throughput improvement.

Latency-Sensitive Applications - The 50% P99 latency reduction matters for real-time applications.

CPU-Constrained Environments - The 31% CPU reduction means you can handle more load on the same hardware.

The Future of io_uring

io_uring continues to evolve:

Linux 5.19+: Multishot operations
Linux 6.0+: Zero-copy send support
Linux 6.1+: Improved buffer ring management
Upcoming: User-space networking (io_uring + XDP)

flashQ will continue adopting new features as they stabilize.

Conclusion

io_uring represents a fundamental shift in how high-performance applications interact with the Linux kernel. For flashQ users, this translates to:

27% higher throughput on Linux systems
50% lower tail latency for time-sensitive workloads
31% less CPU usage for the same workload
Better scalability under high connection counts

The best part? It's automatic. Deploy flashQ on a modern Linux system, and you get io_uring performance out of the box.

Have questions about io_uring or flashQ? Drop a comment below or check out the GitHub repo.

egeominotti / flashq

Blazingly fast job queue server. Process millions of jobs/sec with sub-millisecond latency. Built with Rust.

flashQ

High-performance job queue. No Redis required.

Website • Docs • Blog

Quick Start

# Start server
docker run -d -p 6789:6789 -p 6790:6790 -e HTTP=1 ghcr.io/egeominotti/flashq:latest

# Install SDK
npm install flashq  # or: bun add flashq

import { Queue, Worker } from 'flashq';

const queue = new Queue('tasks');
await queue.add('job', { data: 'hello' });

const worker = new Worker('tasks', async (job) => {
  console.log(job.data);
});

License

MIT

View on GitHub

DEV Community