How a high school project became the most dominant Pi-computing benchmark in the world β and what every software engineer can learn from it.
If someone told you a single program could stress-test your CPU, RAM, and storage simultaneously, recover from hardware failures mid-computation, run for 110 days straight, and spit out 314 trillion digits of Pi at the end β you'd probably assume it was built by a team of PhDs at a national lab.
It was built by one person. It started as a high school project. And it's been setting world records since 2009.
This is y-cruncher. Let's talk about it.
What Even Is y-cruncher?
y-cruncher is a multi-threaded, SIMD-vectorized program that computes mathematical constants β Pi, e, square roots, and more β to trillions of decimal digits. It's the tool of choice for:
- World record Pi computations (every record since 2009 has used it)
- CPU stress testing and overclocking validation
- Memory subsystem benchmarking
- Hardware stability detection (it'll find flaws that Prime95 and AIDA64 miss)
As of November 2025, the current world record stands at 314 trillion digits, computed in a single uninterrupted 110-day run on a 384-core AMD EPYC server. The verification took just 4.37 hours.
Why Should a Software Engineer Care?
Fair question. You're probably not computing Pi for a living. But y-cruncher is a goldmine of fascinating engineering decisions:
- It exploits SIMD instruction sets (SSE, AVX, AVX-512) at a level most production software never touches
- Its checkpoint-restart system is a masterclass in fault-tolerant distributed computation
- It implements custom memory allocators that outperform the OS for specific access patterns
- It demonstrates how multi-socket NUMA topology wreaks havoc on parallel performance β and how to fight back
- Its benchmark results expose the memory bandwidth ceiling that most workloads never hit but y-cruncher constantly runs into
In short: reading about y-cruncher will make you a better systems programmer, even if you never run it.
Getting Started: Installation in Under 2 Minutes
Windows
- Download
y-cruncher v0.8.7.9547b.zipfrom the official site - Extract and run
y-cruncher.exe - You may need the MSVC redistributable if you see DLL errors
Note: Antivirus false positives are common due to the low-level SIMD code. The binary is safe β but the static-linked version was reworked specifically to reduce false positives.
Linux
Choose between two variants based on your needs:
# Static β most portable, works on nearly any distro, no TBB/NUMA binding
wget http://www.numberworld.org/y-cruncher/y-cruncher\ v0.8.7.9547-static.tar.xz
tar -xf "y-cruncher v0.8.7.9547-static.tar.xz"
cd y-cruncher_v0.8.7.9547-static
./y-cruncher
# Dynamic β full features (NUMA binding, TBB) but requires Ubuntu 24.04+ or compatible
wget http://www.numberworld.org/y-cruncher/y-cruncher\ v0.8.7.9547-dynamic.tar.xz
tar -xf "y-cruncher v0.8.7.9547-dynamic.tar.xz"
./y-cruncher
System requirements:
- 64-bit x86/x64 processor
- Windows 8+ or any 64-bit Linux distro
- RAM: as much as you can get β more is almost always better
Running Your First Benchmark
When you launch y-cruncher, you'll get a console menu. For benchmarking:
- Select "Benchmark" from the main menu
- Choose a size (start with 250 million or 1 billion digits β comfortable for most modern desktops)
- Watch it go
What you'll see reported:
Computation mode : Ram Only
Decimal Digits : 1,000,000,000
Hexadecimal Digits: 830,482,023
Start Date : ...
End Date : ...
Total Computation Time : 14.670 seconds
Total Verification Time: 10.421 seconds
Total Time : 25.091 seconds
Tip: "Total Computation Time" is the relevant benchmark number. "Total Time" includes verification, which is a separate algorithmic pass.
What are "good" numbers?
Here's a quick reference for 1 billion digits on common hardware (lower = better):
| Hardware | Time (seconds) |
|---|---|
| Ryzen 9 9950X (16C, DDR5-6000) | ~14.7s |
| Intel Core i9-13900KS | ~15.9s |
| Ryzen 9 7950X (16C, DDR5-5200) | ~16.8s |
| Ryzen 9 3950X (16C, DDR4-3200) | ~29.5s |
| Core i7-11800H (laptop, 60W) | ~32.3s |
If your number is significantly higher than expected for your hardware, it's usually a memory configuration issue (see below).
The Memory Bandwidth Trap: Why Your Expensive CPU Might Be Underperforming
This is one of the most practically useful things y-cruncher teaches.
y-cruncher is memory-bound. Almost completely. On every high-end desktop since ~2012, the CPU sits and waits for data. This means:
- GHz doesn't matter as much as memory bandwidth
- Unpopulated DIMM slots hurt you more than you think
- Memory frequency matters enormously
Real benchmark showing this on a Core i9-7940X:
| Memory speed | 10 billion digit time |
|---|---|
| DDR4-2666 | 365 seconds |
| DDR4-3466 | 322 seconds |
That's a ~12% speedup just from faster RAM on the same CPU.
How to maximize memory bandwidth for y-cruncher
# Checklist:
# 1. Populate ALL memory channels
# (4-channel platform = use 4 DIMMs, not 2)
# 2. Enable XMP/EXPO in BIOS
# Most DDR4/DDR5 kits ship at JEDEC defaults (3200/4800)
# XMP can push to 6000+ MT/s on DDR5
# 3. On Skylake-X specifically: also overclock L3 cache
# (L3 bandwidth is an additional bottleneck on that architecture)
# 4. Enable Large Pages (Windows)
# Run y-cruncher as Administrator for this to work
# Post-Spectre/Meltdown mitigations cause up to 5% overhead
# Large pages bypass the problematic page table walk
The Engineering Marvel: Checkpoint-Restart
Here's what separates y-cruncher from just being a fast calculator.
Computing 300 trillion digits of Pi takes months. On commodity hardware. With power outages, kernel panics, memory errors, and cosmic ray bit flips all waiting to destroy your work.
y-cruncher's solution is a robust checkpoint-restart system that:
- Periodically snapshots computation state to disk
- Verifies checkpoints with redundancy checks (catching hardware bit errors before they propagate)
- Resumes automatically after any interruption β even after software bugs in y-cruncher itself have been fixed and the computation re-started
Several world record computations have survived bugs in y-cruncher itself because the checkpoint infrastructure caught the error, allowed a fix, and resumed from the last valid state. That's some serious fault tolerance engineering.
The 314 trillion digit record in November 2025 is remarkable specifically because it was the first recent record achieved without checkpointing β a single uninterrupted 110-day run. This is described as Storage Review's third attempt; the previous two were stopped by hardware/software issues.
For the software engineers in the room: this is a production-grade distributed systems problem solved elegantly in a desktop application.
Advanced: The NUMA Problem (And Why Multi-Socket Is Hard)
If you're running y-cruncher on a workstation with two CPUs β or benchmarking cloud instances β pay attention.
On multi-socket (NUMA) systems, memory access latency and bandwidth are asymmetric. Core 0 on Socket 0 reaches Socket 0's RAM in ~80ns. It reaches Socket 1's RAM in ~140ns. If your threads are spread across both sockets but chasing the same data, you'll hit contention that tanks throughput.
y-cruncher's documentation is explicit about this:
- Benchmark numbers on multi-socket machines "may not be entirely representative of what the hardware is capable of"
- The Push Pool vs Cilk Plus scheduler choice matters for machines with >64 cores
- Node-interleaving in the BIOS should be disabled on Windows systems with >64 logical cores (otherwise you get imbalanced processor groups)
The load imbalance symptoms to watch for:
- Core count is not a power of two
- Cores are heterogeneous (hybrid architectures like Intel's P+E core designs)
- Background processes stealing cycles from any single thread
Using y-cruncher for Stress Testing (The Real Killer App)
Many overclockers and hardware enthusiasts use y-cruncher specifically because it's uniquely good at exposing instability:
It simultaneously maxes out CPU computation AND the entire memory subsystem. Most stress tests only hit one or the other. y-cruncher hits both at the same time.
The telltale error you'll see on unstable hardware:
Redundancy Check Failed: Coefficient is too large
This means the two independent algorithmic passes (compute + verify) produced different results β which means either the CPU computed something wrong, or memory delivered corrupted data.
If you see this on a machine you thought was stable: don't ignore it. This is y-cruncher doing exactly what it's designed to do.
Common causes:
- RAM running at XMP/EXPO speeds without sufficient voltage
- CPU overclocked too aggressively (especially with AVX offsets not set)
- Thermal throttling corrupting in-flight computation
- Subtimings too tight on the memory controller
Understanding the Algorithms (For the Curious)
y-cruncher uses two independent algorithms for most constants:
Computation pass: Chudnovsky algorithm for Pi (exponentially converging series). Each term adds ~14.18 digits of Pi. The challenge is computing this in arbitrary precision β which requires implementing arithmetic on numbers with trillions of digits.
Verification pass: A different formula (e.g., Ramanujan's formula, or Bailey-Borwein-Plouffe variants) to independently confirm the result. If both passes agree to the last digit, the result is almost certainly correct.
The interesting engineering is in the big number arithmetic:
- Uses Number Theoretic Transforms (NTTs) β the modular arithmetic equivalent of FFTs β for multiplication
- Exploits SIMD vector units (AVX-512 on modern hardware) to parallelize the transform butterflies
- Implements cache-oblivious algorithms for better memory access patterns during the transform
The result: multiplication of two N-digit numbers in O(N log N) time instead of O(NΒ²). At a trillion digits, this difference is the gap between "computationally feasible" and "computationally impossible."
Practical Takeaways for Software Engineers
Whether you run y-cruncher or never touch it, here's what to take away:
1. Memory bandwidth is often your real bottleneck. Profile for it. Don't just assume your algorithm is CPU-bound.
2. Fault tolerance is a first-class feature. The world record wouldn't exist without checkpoint-restart. Think about what your long-running jobs do when a node dies.
3. NUMA topology changes everything in parallel code. Thread affinity and memory locality matter more than raw core count on multi-socket systems.
4. SIMD is still a performance multiplier worth understanding. A 4Γ speedup from AVX-2 or 8Γ from AVX-512 is not unusual for data-parallel numerical code. Compilers help, but hand-tuning helps more.
5. Stress test with something that actually stresses. y-cruncher's simultaneous CPU + memory load finds problems that single-threaded or DRAM-only tests miss entirely.
Where to Go From Here
- Download y-cruncher: numberworld.org/y-cruncher
- GitHub mirror: Available for HTTPS downloads (linked from the main site)
- Benchmarks leaderboard: Rankings from 25 million to 1 trillion digits are published on the site
- Advanced docs: The site has deep documentation on multi-threading internals, memory allocation strategies, and swap mode configuration
- Mersenneforum subforum: For discussion with the community doing record-level runs
The fact that a program started as a high school project now drives the world's most demanding computational records β and teaches systems programmers lessons about memory bandwidth, fault tolerance, NUMA, and SIMD β is genuinely inspiring.
Go run it. Break your overclock. Then fix it and run it again.
Top comments (0)