The Origin Story: It Started With a "New Toy"
About six months ago, I was sitting at my desk with a serious case of coder’s block. I didn’t want to grind through existing bugs; I needed a win. A dopamine hit.
I also happened to be staring at a new (to me) monster of a workstation: 4.3GHz i7, 64GB RAM, Quadro GPU. It was a beast, and I realized I had never actually pushed it. I wanted to see the metal glow. I wanted to make the fans spin.
So I set a challenge: Write the absolute fastest, most CPU-punishing code possible in C++.
Math is heavy, so I landed on Prime Generation. I found a paper on optimizing the Sieve of Eratosthenes, dug into SIMD intrinsics (AVX2/512), and spent a frantic weekend obsessing over cache lines and cycle counts. I managed to beat the standard library implementations, watched my CPU hit 100% usage... and then, just like that, the dopamine faded.
"Cool. Next."
I shelved the code and forgot about it.
The Flashback
Fast forward to 3 days ago. I was compiling a Rust project, watching the dependencies scroll by.
Compiling rand...
Compiling primes...
Wait.
It hit me like a brick. I had an entire codebase of highly optimized, battle-tested prime generation logic sitting in a dusty C++ repo. I had already solved the hard parts—the bit-twiddling, the cache-locality, the memory layout.
Why wasn't I using it in Rust?
The Frenzy: Porting to Rust
The fear of losing the idea before I could implement it kicked in. I launched into a 12-hour coding fugue state.
The goal wasn't just a port; it was a reimagining. In C++, I was raw-dogging pointers. In Rust, I wanted that same performance but with safety and better ergonomics.
I ended up with Primer—a crate that is:
Bit-Packed: Uses 1 bit per odd number (effectively 0.5 bits per number).
Odd-Only: We hardcode 2 and ignore even numbers entirely.
Intrinsics-Powered: Uses trailing_zeros (compiles to tzcnt on x86) to scan the sieve instantly.
Kernighan-Iterated: We skip zeros at the hardware level.
The Result: 64x Faster & 95x Smaller
I didn't expect the Rust compiler to play this nice, but the results blew me away.
Benchmark (n = 50,000,000):
Standard Vec<bool> implementation: ~47 MB RAM
Primer (Segmented Buffer): ~32 KB RAM
Speed: Up to 64x faster than the primes crate in bulk generation.
Why This Matters (Beyond the Speed)
This project wasn't born from a need for primes. It was born from boredom and a desire to see a computer work.

Sometimes the best code doesn't come from a ticket or a spec sheet. It comes from staring at a blinking cursor at 2 AM, wondering, "I wonder if I can make this go faster..."
And then accidentally building the fastest thing in the room.
Check out the code (and the benchmarks) on GitHub:
👉 https://github.com/whisprer/primer
(Also, if you are an embedded Rust dev—ESP32/Raspberry Pi—this 32KB footprint is specifically for you. Go wild.)
Top comments (0)