DEV Community

Egeo Minotti
Egeo Minotti

Posted on

io_uring How flashQ Achieves Kernel-Level Async IO Performance

When building flashQ, I faced a fundamental challenge: how do you build a job queue that can handle hundreds of thousands of operations per second while maintaining low latency?

The answer lies in one of Linux's most significant kernel innovations in the past decade: io_uring.

The Problem with Traditional Async I/O

Before io_uring, Linux applications had two primary options for handling I/O.

Blocking I/O with Thread Pools

The traditional approach: spawn threads, let them block on I/O operations. Simple, but inefficient:

  • Context switching overhead: Each thread switch costs 1-10 microseconds
  • Memory overhead: Each thread requires its own stack (typically 8MB)
  • Scalability limits: Thousands of concurrent connections = thousands of threads

Event-driven I/O (epoll/kqueue)

Modern async runtimes like Tokio use epoll (Linux) or kqueue (macOS/BSD):

int epfd = epoll_create1(0);
epoll_ctl(epfd, EPOLL_CTL_ADD, sockfd, &event);

while (1) {
    int n = epoll_wait(epfd, events, MAX_EVENTS, -1);  // syscall
    for (int i = 0; i < n; i++) {
        read(events[i].data.fd, buf, len);  // syscall
        process(buf);
        write(events[i].data.fd, response, len);  // syscall
    }
}
Enter fullscreen mode Exit fullscreen mode

Better than threads, but still problematic:

  • Syscall overhead: Every read(), write(), accept() requires a kernel transition
  • Data copying: Data must be copied between kernel and user space
  • Notification only: epoll tells you a socket is ready, but you still need syscalls to do actual I/O

For high-throughput applications, these syscalls become the bottleneck. A server handling 100K requests/second makes 300K+ syscalls per second just for basic I/O.

Enter io_uring: A Paradigm Shift

Introduced in Linux 5.1 (2019) by Jens Axboe, io_uring fundamentally changes how applications interact with the kernel for I/O operations.

The Core Innovation: Shared Ring Buffers

io_uring creates two ring buffers shared between user space and kernel space:

+-------------------------------------------------------------+
|                      User Space                              |
|  +---------------------+    +---------------------+         |
|  |  Submission Queue   |    |  Completion Queue   |         |
|  |  (SQ) - Requests    |    |  (CQ) - Results     |         |
|  |                     |    |                     |         |
|  |  [read fd=5, ...]   |    |  [done, 1024 bytes] |         |
|  |  [write fd=7, ...]  |    |  [done, 512 bytes]  |         |
|  |  [accept fd=3, ...] |    |  [error, EAGAIN]    |         |
|  +----------+----------+    +----------^----------+         |
|             | shared memory            |                     |
+-------------|--------------------------+---------------------+
|             |      Kernel Space        |                     |
|             v                          |                     |
|  +-------------------------------------+---+                |
|  |           io_uring Subsystem            |                |
|  |                                         |                |
|  |   - Processes SQ entries                |                |
|  |   - Performs actual I/O                 |                |
|  |   - Posts results to CQ                 |                |
|  +-----------------------------------------+                |
+-------------------------------------------------------------+
Enter fullscreen mode Exit fullscreen mode

Key Benefits at a Glance

Feature Traditional (epoll) io_uring
Syscalls per I/O 1-3 per operation 0 (batched submission)
Data copies User - Kernel Zero-copy possible
Batching Not native Submit hundreds at once
Kernel polling No Yes (SQPOLL mode)
Fixed buffers No Yes (registered buffers)

How flashQ Uses io_uring

flashQ is written in Rust, leveraging the tokio-uring crate for io_uring support. Here's how we integrate it.

Runtime Detection

flashQ automatically detects the optimal I/O backend at startup:

pub fn select_io_backend() -> IoBackend {
    #[cfg(target_os = "linux")]
    {
        if io_uring_available() && kernel_version() >= (5, 1) {
            return IoBackend::IoUring;
        }
        return IoBackend::Epoll;
    }

    #[cfg(target_os = "macos")]
    return IoBackend::Kqueue;

    #[cfg(target_os = "windows")]
    return IoBackend::Iocp;
}
Enter fullscreen mode Exit fullscreen mode

You'll see the active backend in the startup logs:

# Linux with io_uring
INFO flashq_server::runtime: IO backend: io_uring (kernel-level async)

# macOS
INFO flashq_server::runtime: IO backend: kqueue (poll-based async)
Enter fullscreen mode Exit fullscreen mode

Batched Operations

One of io_uring's biggest advantages is batching. Instead of making individual syscalls, flashQ batches multiple operations:

async fn handle_connections(ring: &IoUring) {
    let mut submissions = Vec::with_capacity(32);

    // Collect pending operations
    for conn in pending_connections.drain(..) {
        submissions.push(ReadOp::new(conn.fd, conn.buffer));
    }

    // Submit all at once - single syscall for 32 operations
    ring.submit_batch(&submissions).await;

    // Process completions
    for completion in ring.completions() {
        handle_completion(completion);
    }
}
Enter fullscreen mode Exit fullscreen mode

This reduces syscall overhead by 95%+ under high load.

Zero-Copy Networking

With registered buffers, flashQ can perform true zero-copy I/O:

// Register buffers once at startup
let buffers = IoUring::register_buffers(
    (0..BUFFER_COUNT)
        .map(|_| vec![0u8; BUFFER_SIZE])
        .collect()
);

// Use registered buffers for I/O - no copying!
async fn read_message(fd: RawFd, buf_idx: u16) -> io::Result<usize> {
    ring.read_fixed(fd, buf_idx, 0).await
}
Enter fullscreen mode Exit fullscreen mode

Data flows directly from network card to application memory without intermediate kernel buffer copies.

SQPOLL Mode for Ultra-Low Latency

In SQPOLL mode, the kernel continuously polls the submission queue without requiring any syscalls:

let ring = IoUring::builder()
    .setup_sqpoll(2000)  // Poll for 2ms before sleeping
    .build()?;

// Submissions are picked up automatically by kernel thread
// No io_uring_enter() syscall needed!
Enter fullscreen mode Exit fullscreen mode

This is ideal for latency-sensitive workloads where every microsecond counts.

Performance Impact

I benchmarked flashQ with and without io_uring on identical hardware (AMD EPYC 7763, 64 cores, 128GB RAM).

Throughput Comparison

Metric epoll io_uring Improvement
Jobs pushed/sec 245,000 312,000 +27%
Jobs processed/sec 180,000 228,000 +26%
P99 latency (push) 1.8ms 0.9ms -50%
P99 latency (fetch) 2.1ms 1.1ms -48%
CPU usage at 100K/s 45% 31% -31%
Syscalls/sec at 100K ops ~320,000 ~12,000 -96%

Latency Distribution

Push Latency Distribution (100K jobs/sec sustained)

epoll:
  P50:  0.4ms  ################
  P90:  1.2ms  ########################################
  P99:  1.8ms  ############################################################
  P999: 4.2ms  ######################################################################

io_uring:
  P50:  0.2ms  ########
  P90:  0.6ms  ####################
  P99:  0.9ms  ##############################
  P999: 1.8ms  ############################################################
Enter fullscreen mode Exit fullscreen mode

Getting Started

Docker (Recommended)

# Our official image has io_uring enabled
docker run -d -p 6789:6789 ghcr.io/egeominotti/flashq:latest

# Verify io_uring is active
docker logs flashq | grep "IO backend"
Enter fullscreen mode Exit fullscreen mode

Building from Source

git clone https://github.com/egeominotti/flashq.git
cd flashq

# Build with io_uring feature
cargo build --release --features io-uring

./target/release/flashq-server
Enter fullscreen mode Exit fullscreen mode

Requirements:

  • Linux kernel 5.1+ (5.10+ recommended)
  • liburing installed
  • For SQPOLL mode: CAP_SYS_ADMIN or root

Platform Compatibility

flashQ runs everywhere with automatic backend selection:

Platform I/O Backend Notes
Linux (kernel 5.1+) io_uring Fastest, kernel-level async
Linux (older kernels) epoll Fast, poll-based
macOS kqueue Native, optimal for macOS
Windows IOCP Native, optimal for Windows

When io_uring Makes the Biggest Difference

High Connection Count - With thousands of concurrent connections, the syscall reduction is dramatic.

High Throughput Workloads - AI workloads pushing hundreds of thousands of jobs benefit enormously from the 27% throughput improvement.

Latency-Sensitive Applications - The 50% P99 latency reduction matters for real-time applications.

CPU-Constrained Environments - The 31% CPU reduction means you can handle more load on the same hardware.

The Future of io_uring

io_uring continues to evolve:

  • Linux 5.19+: Multishot operations
  • Linux 6.0+: Zero-copy send support
  • Linux 6.1+: Improved buffer ring management
  • Upcoming: User-space networking (io_uring + XDP)

flashQ will continue adopting new features as they stabilize.

Conclusion

io_uring represents a fundamental shift in how high-performance applications interact with the Linux kernel. For flashQ users, this translates to:

  • 27% higher throughput on Linux systems
  • 50% lower tail latency for time-sensitive workloads
  • 31% less CPU usage for the same workload
  • Better scalability under high connection counts

The best part? It's automatic. Deploy flashQ on a modern Linux system, and you get io_uring performance out of the box.


Have questions about io_uring or flashQ? Drop a comment below or check out the GitHub repo.

GitHub logo egeominotti / flashq

Blazingly fast job queue server. Process millions of jobs/sec with sub-millisecond latency. Built with Rust.

flashQ

High-performance job queue. No Redis required.

CI npm License

WebsiteDocsBlog


Quick Start

# Start server
docker run -d -p 6789:6789 -p 6790:6790 -e HTTP=1 ghcr.io/egeominotti/flashq:latest

# Install SDK
npm install flashq  # or: bun add flashq
Enter fullscreen mode Exit fullscreen mode
import { Queue, Worker } from 'flashq';

const queue = new Queue('tasks');
await queue.add('job', { data: 'hello' });

const worker = new Worker('tasks', async (job) => {
  console.log(job.data);
});
Enter fullscreen mode Exit fullscreen mode

Links

License

MIT




Top comments (0)