How Boredom + A New PC Led to a 64x Faster Prime Sieve in Rust (The "Dopamine optimization" Story)

#rust #performance #systems #showdev

The Origin Story: It Started With a "New Toy"
About six months ago, I was sitting at my desk with a serious case of coder’s block. I didn’t want to grind through existing bugs; I needed a win. A dopamine hit.

I also happened to be staring at a new (to me) monster of a workstation: 4.3GHz i7, 64GB RAM, Quadro GPU. It was a beast, and I realized I had never actually pushed it. I wanted to see the metal glow. I wanted to make the fans spin.

So I set a challenge: Write the absolute fastest, most CPU-punishing code possible in C++.

Math is heavy, so I landed on Prime Generation. I found a paper on optimizing the Sieve of Eratosthenes, dug into SIMD intrinsics (AVX2/512), and spent a frantic weekend obsessing over cache lines and cycle counts. I managed to beat the standard library implementations, watched my CPU hit 100% usage... and then, just like that, the dopamine faded.

"Cool. Next."

I shelved the code and forgot about it.

The Flashback
Fast forward to 3 days ago. I was compiling a Rust project, watching the dependencies scroll by.

Compiling rand...
Compiling primes...

Wait.

It hit me like a brick. I had an entire codebase of highly optimized, battle-tested prime generation logic sitting in a dusty C++ repo. I had already solved the hard parts—the bit-twiddling, the cache-locality, the memory layout.

Why wasn't I using it in Rust?

The Frenzy: Porting to Rust
The fear of losing the idea before I could implement it kicked in. I launched into a 12-hour coding fugue state.

The goal wasn't just a port; it was a reimagining. In C++, I was raw-dogging pointers. In Rust, I wanted that same performance but with safety and better ergonomics.

I ended up with Primer—a crate that is:

Bit-Packed: Uses 1 bit per odd number (effectively 0.5 bits per number).

Odd-Only: We hardcode 2 and ignore even numbers entirely.

Intrinsics-Powered: Uses trailing_zeros (compiles to tzcnt on x86) to scan the sieve instantly.

Kernighan-Iterated: We skip zeros at the hardware level.

The Result: 64x Faster & 95x Smaller
I didn't expect the Rust compiler to play this nice, but the results blew me away.

Benchmark (n = 50,000,000):

Standard Vec<bool> implementation: ~47 MB RAM

Primer (Segmented Buffer): ~32 KB RAM

Speed: Up to 64x faster than the primes crate in bulk generation.

Why This Matters (Beyond the Speed)
This project wasn't born from a need for primes. It was born from boredom and a desire to see a computer work.

Sometimes the best code doesn't come from a ticket or a spec sheet. It comes from staring at a blinking cursor at 2 AM, wondering, "I wonder if I can make this go faster..."

And then accidentally building the fastest thing in the room.

Check out the code (and the benchmarks) on GitHub:
👉 https://github.com/whisprer/primer

(Also, if you are an embedded Rust dev—ESP32/Raspberry Pi—this 32KB footprint is specifically for you. Go wild.)