codeapprentice

Posted on Nov 17

Low Latency Rust : Building a Cache-Friendly, Lock-Free SPSC Ring Buffer in Rust

#rust #performance #lowlatency

In this post, you will learn a fundamental programming pattern in low-latency, high-performance engineering known as the Single Producer Single Consumer Atomic Ring Buffer.

But first , why do we even need one ? The "obvious" way to share data between two threads in Rust is with a Mutex.

use std::sync::{Arc, Mutex};
use std::collections::VecDeque;

let queue = Arc::new(Mutex::new(VecDeque::new()));

The problem is contention. Every time the producer wants to send, it has to lock the mutex. If the consumer is currently holding that lock, the producer has to wait. This is a "stop-the-world" event that kills low-latency performance.

We can do better. We can build a "lock-free" queue. To do this, we'll build it from scratch using atomics.

The Blueprint: The SPSC Ring Buffer.

Our data structure looks like this

pub struct RingBuffer<T> {
    buffer: Box<[UnsafeCell<MaybeUninit<T>>]>,
    cap: usize,
    head: CachePadded<AtomicUsize>,
    tail: CachePadded<AtomicUsize>,
}

The core idea is a circular buffer (a ring) with two "pointers":

head: The "write" pointer. Only the producer can touch this.

tail: The "read" pointer. Only the consumer can touch this.

This SPSC model is the key: by giving head and tail exclusive "owners", we've already eliminated the main source of contention!

But the design creates two new and very subtle problems.

Dragon #1: The "False Sharing" Performance Disaster

You probably noticed the CachePadded type. This is the single most important performance optimisation in the whole struct.

What is "False Sharing" ?

A CPU doesn't read from RAM byte-by-byte. It reads in 64-byte (or 128-byte) chunks called cache lines.

Analogy: Your CPU is a worker at a workbench (L1 Cache). RAM is a giant warehouse. The worker doesn't walk to the warehouse to get one screw (head). It grabs the entire toolbox drawer (the 64-byte cache line) that the screw is in.

The disaster:
Your head (8 bytes) and tail (8 bytes) are small. Without padding, the OS will put them right next to each other in memory, almost certainly on the same cache line.

Now watch what happens:

Producer (Core 1) wants to write to head. It grabs the whole cache line into its L1 cache.

Consumer (Core 2) wants to write to tail. It grabs the same whole cache line into its L1 cache.

Producer writes to head. This "dirties" its cache line and invalidates the cache line on Core 2.

Consumer now has to stall, wait, and re-fetch the entire cache line from main memory, just to write to tail.

This "cache-line ping-pong" makes your "lock-free" queue slower than a Mutex.

The solution:

CachePadded adds ~56 bytes of empty padding around head and tail, forcing them to live on different cache lines.

Cache Line 1: [ head (8b) | ... 56 bytes of padding ... ]
Cache Line 2: [ tail (8b) | ... 56 bytes of padding ... ]

Now, the producer and consumer can work on head and tail in parallel with zero hardware-level contention. We've solved the "false sharing" problem.

Dragon #2: The "Lying" CPU (Memory Reordering)

The second, and more dangerous, challenge is memory ordering.

The compiler and the CPU are both allowed to re-order your instructions to make them run faster. This is usually fine, but in concurrent code, it's a nightmare.

Consider our send code. Logically, it must do this:

buffer[idx] = data; // Write the data

self.head.store(head + 1); // "Publish" that data is ready

The Nightmare Reordering:

What if the CPU reorders these steps?

self.head.store(head + 1); // Publish!

Consumer (Core 2) wakes up, sees the new head!

Consumer reads from buffer[idx]... and gets garbage data!

buffer[idx] = data; // ...Producer writes the real data, too late.

The solution: The Acquire/Release Handshake

We fix this by telling the CPU "NO, you cannot reorder these things." We use std::sync::atomic::Ordering.

Release (A command): "Do not reorder any writes from before this point to after this point."

Acquire (A promise): "If I see a new value, I am guaranteed to also see all memory writes that came before the Release that published it."

We create a "synchronises with" relationship.

Producer's send:

Write the data

unsafe { (*slot_ptr).write(item); }

Publish the write using Release

self.head.store(head.wrapping_add(1), Ordering::Release);

Consumer's recv:

Check for new data using Acquire

let head = self.head.load(Ordering::Acquire);

If head > tail, we are guaranteed to see the data

let item = unsafe { (*slot_ptr).assume_init_read() };

This Release-Acquire pair creates a "one-way wall," ensuring the consumer never reads data before the producer has written it.

(We do the exact same thing in reverse for the tail pointer to tell the producer that a slot has been freed!)

With the two dragons slain, the rest is just "good engineering."

The & Trick (Power-of-2):

We force the buffer's capacity to be a power of 2 (e.g., 1024). Why?

Slow: head % 1024 (Modulo/division is very slow)

Fast: head & (1024 - 1) (Bitwise-AND is lightning fast)

This cap.next_power_of_two() trick replaces a slow arithmetic operation with a single, fast bitwise operation.

UnsafeCell & MaybeUninit:

UnsafeCell: This is Rust's "interior mutability" escape hatch. It's what lets us write to the buffer from an immutable &self reference.

MaybeUninit: We use this to avoid the overhead of Option. It allows us to treat the buffer as raw, uninitialized memory and safely write Ts into it and assume_init_read() them back out.

The Drop Trait:
We must implement Drop to manually "drop" any Ts left in the buffer when the RingBuffer goes out of scope. This prevents memory leaks.

The Final Code & Proof

Here is the link to the full, commented source code on Gist/GitHub.

https://github.com/aodr3w/llt-rs/blob/main/src/ring_buffer

And does it work? Yes. We can run cargo test and see our multi-threaded test pass, transferring 1,000,000 items between two threads with zero locks, zero data-races, and maximum performance.

$ cargo test
...
running 4 tests
test tests::test_full_and_empty ... ok
test tests::test_single_thread_send_recv ... ok
test tests::test_drop_cleanup ... ok
test tests::test_multi_thread_spsc ... ok

test result: ok. 4 passed; 0 failed; ...

What's Next?

We've built a blazing-fast, non-blocking queue. But what if the queue is empty? Our recv just returns None, forcing the consumer to spin in a hot loop.

In the next post, we'll use this RingBuffer primitive to build a higher-level, blocking SPSC Channel that can efficiently put a thread to sleep with Condvar when there's no work to do.

See you then :-)

DEV Community

Low Latency Rust : Building a Cache-Friendly, Lock-Free SPSC Ring Buffer in Rust

Top comments (0)