When building flashQ, I faced a fundamental challenge: how do you build a job queue that can handle hundreds of thousands of operations per second while maintaining low latency?
The answer lies in one of Linux's most significant kernel innovations in the past decade: io_uring.
The Problem with Traditional Async I/O
Before io_uring, Linux applications had two primary options for handling I/O.
Blocking I/O with Thread Pools
The traditional approach: spawn threads, let them block on I/O operations. Simple, but inefficient:
- Context switching overhead: Each thread switch costs 1-10 microseconds
- Memory overhead: Each thread requires its own stack (typically 8MB)
- Scalability limits: Thousands of concurrent connections = thousands of threads
Event-driven I/O (epoll/kqueue)
Modern async runtimes like Tokio use epoll (Linux) or kqueue (macOS/BSD):
int epfd = epoll_create1(0);
epoll_ctl(epfd, EPOLL_CTL_ADD, sockfd, &event);
while (1) {
int n = epoll_wait(epfd, events, MAX_EVENTS, -1); // syscall
for (int i = 0; i < n; i++) {
read(events[i].data.fd, buf, len); // syscall
process(buf);
write(events[i].data.fd, response, len); // syscall
}
}
Better than threads, but still problematic:
-
Syscall overhead: Every
read(),write(),accept()requires a kernel transition - Data copying: Data must be copied between kernel and user space
- Notification only: epoll tells you a socket is ready, but you still need syscalls to do actual I/O
For high-throughput applications, these syscalls become the bottleneck. A server handling 100K requests/second makes 300K+ syscalls per second just for basic I/O.
Enter io_uring: A Paradigm Shift
Introduced in Linux 5.1 (2019) by Jens Axboe, io_uring fundamentally changes how applications interact with the kernel for I/O operations.
The Core Innovation: Shared Ring Buffers
io_uring creates two ring buffers shared between user space and kernel space:
+-------------------------------------------------------------+
| User Space |
| +---------------------+ +---------------------+ |
| | Submission Queue | | Completion Queue | |
| | (SQ) - Requests | | (CQ) - Results | |
| | | | | |
| | [read fd=5, ...] | | [done, 1024 bytes] | |
| | [write fd=7, ...] | | [done, 512 bytes] | |
| | [accept fd=3, ...] | | [error, EAGAIN] | |
| +----------+----------+ +----------^----------+ |
| | shared memory | |
+-------------|--------------------------+---------------------+
| | Kernel Space | |
| v | |
| +-------------------------------------+---+ |
| | io_uring Subsystem | |
| | | |
| | - Processes SQ entries | |
| | - Performs actual I/O | |
| | - Posts results to CQ | |
| +-----------------------------------------+ |
+-------------------------------------------------------------+
Key Benefits at a Glance
| Feature | Traditional (epoll) | io_uring |
|---|---|---|
| Syscalls per I/O | 1-3 per operation | 0 (batched submission) |
| Data copies | User - Kernel | Zero-copy possible |
| Batching | Not native | Submit hundreds at once |
| Kernel polling | No | Yes (SQPOLL mode) |
| Fixed buffers | No | Yes (registered buffers) |
How flashQ Uses io_uring
flashQ is written in Rust, leveraging the tokio-uring crate for io_uring support. Here's how we integrate it.
Runtime Detection
flashQ automatically detects the optimal I/O backend at startup:
pub fn select_io_backend() -> IoBackend {
#[cfg(target_os = "linux")]
{
if io_uring_available() && kernel_version() >= (5, 1) {
return IoBackend::IoUring;
}
return IoBackend::Epoll;
}
#[cfg(target_os = "macos")]
return IoBackend::Kqueue;
#[cfg(target_os = "windows")]
return IoBackend::Iocp;
}
You'll see the active backend in the startup logs:
# Linux with io_uring
INFO flashq_server::runtime: IO backend: io_uring (kernel-level async)
# macOS
INFO flashq_server::runtime: IO backend: kqueue (poll-based async)
Batched Operations
One of io_uring's biggest advantages is batching. Instead of making individual syscalls, flashQ batches multiple operations:
async fn handle_connections(ring: &IoUring) {
let mut submissions = Vec::with_capacity(32);
// Collect pending operations
for conn in pending_connections.drain(..) {
submissions.push(ReadOp::new(conn.fd, conn.buffer));
}
// Submit all at once - single syscall for 32 operations
ring.submit_batch(&submissions).await;
// Process completions
for completion in ring.completions() {
handle_completion(completion);
}
}
This reduces syscall overhead by 95%+ under high load.
Zero-Copy Networking
With registered buffers, flashQ can perform true zero-copy I/O:
// Register buffers once at startup
let buffers = IoUring::register_buffers(
(0..BUFFER_COUNT)
.map(|_| vec![0u8; BUFFER_SIZE])
.collect()
);
// Use registered buffers for I/O - no copying!
async fn read_message(fd: RawFd, buf_idx: u16) -> io::Result<usize> {
ring.read_fixed(fd, buf_idx, 0).await
}
Data flows directly from network card to application memory without intermediate kernel buffer copies.
SQPOLL Mode for Ultra-Low Latency
In SQPOLL mode, the kernel continuously polls the submission queue without requiring any syscalls:
let ring = IoUring::builder()
.setup_sqpoll(2000) // Poll for 2ms before sleeping
.build()?;
// Submissions are picked up automatically by kernel thread
// No io_uring_enter() syscall needed!
This is ideal for latency-sensitive workloads where every microsecond counts.
Performance Impact
I benchmarked flashQ with and without io_uring on identical hardware (AMD EPYC 7763, 64 cores, 128GB RAM).
Throughput Comparison
| Metric | epoll | io_uring | Improvement |
|---|---|---|---|
| Jobs pushed/sec | 245,000 | 312,000 | +27% |
| Jobs processed/sec | 180,000 | 228,000 | +26% |
| P99 latency (push) | 1.8ms | 0.9ms | -50% |
| P99 latency (fetch) | 2.1ms | 1.1ms | -48% |
| CPU usage at 100K/s | 45% | 31% | -31% |
| Syscalls/sec at 100K ops | ~320,000 | ~12,000 | -96% |
Latency Distribution
Push Latency Distribution (100K jobs/sec sustained)
epoll:
P50: 0.4ms ################
P90: 1.2ms ########################################
P99: 1.8ms ############################################################
P999: 4.2ms ######################################################################
io_uring:
P50: 0.2ms ########
P90: 0.6ms ####################
P99: 0.9ms ##############################
P999: 1.8ms ############################################################
Getting Started
Docker (Recommended)
# Our official image has io_uring enabled
docker run -d -p 6789:6789 ghcr.io/egeominotti/flashq:latest
# Verify io_uring is active
docker logs flashq | grep "IO backend"
Building from Source
git clone https://github.com/egeominotti/flashq.git
cd flashq
# Build with io_uring feature
cargo build --release --features io-uring
./target/release/flashq-server
Requirements:
- Linux kernel 5.1+ (5.10+ recommended)
- liburing installed
- For SQPOLL mode: CAP_SYS_ADMIN or root
Platform Compatibility
flashQ runs everywhere with automatic backend selection:
| Platform | I/O Backend | Notes |
|---|---|---|
| Linux (kernel 5.1+) | io_uring | Fastest, kernel-level async |
| Linux (older kernels) | epoll | Fast, poll-based |
| macOS | kqueue | Native, optimal for macOS |
| Windows | IOCP | Native, optimal for Windows |
When io_uring Makes the Biggest Difference
High Connection Count - With thousands of concurrent connections, the syscall reduction is dramatic.
High Throughput Workloads - AI workloads pushing hundreds of thousands of jobs benefit enormously from the 27% throughput improvement.
Latency-Sensitive Applications - The 50% P99 latency reduction matters for real-time applications.
CPU-Constrained Environments - The 31% CPU reduction means you can handle more load on the same hardware.
The Future of io_uring
io_uring continues to evolve:
- Linux 5.19+: Multishot operations
- Linux 6.0+: Zero-copy send support
- Linux 6.1+: Improved buffer ring management
- Upcoming: User-space networking (io_uring + XDP)
flashQ will continue adopting new features as they stabilize.
Conclusion
io_uring represents a fundamental shift in how high-performance applications interact with the Linux kernel. For flashQ users, this translates to:
- 27% higher throughput on Linux systems
- 50% lower tail latency for time-sensitive workloads
- 31% less CPU usage for the same workload
- Better scalability under high connection counts
The best part? It's automatic. Deploy flashQ on a modern Linux system, and you get io_uring performance out of the box.
Have questions about io_uring or flashQ? Drop a comment below or check out the GitHub repo.
egeominotti
/
flashq
Blazingly fast job queue server. Process millions of jobs/sec with sub-millisecond latency. Built with Rust.
Quick Start
# Start server
docker run -d -p 6789:6789 -p 6790:6790 -e HTTP=1 ghcr.io/egeominotti/flashq:latest
# Install SDK
npm install flashq # or: bun add flashq
import { Queue, Worker } from 'flashq';
const queue = new Queue('tasks');
await queue.add('job', { data: 'hello' });
const worker = new Worker('tasks', async (job) => {
console.log(job.data);
});
Links
- Documentation: flashq.dev/docs
- Examples: sdk/typescript/examples/
- Releases: GitHub Releases
License
MIT
Top comments (0)