Gouranga Das Samrat

Posted on Jun 11

🥧 314 Trillion Digits of Pi: The Software Engineering Secrets Behind y-cruncher

#cpp #algorithms #computerscience #performance

How a high school project became the most dominant Pi-computing benchmark in the world — and what every software engineer can learn from it.

If someone told you a single program could stress-test your CPU, RAM, and storage simultaneously, recover from hardware failures mid-computation, run for 110 days straight, and spit out 314 trillion digits of Pi at the end — you'd probably assume it was built by a team of PhDs at a national lab.

It was built by one person. It started as a high school project. And it's been setting world records since 2009.

This is y-cruncher. Let's talk about it.

What Even Is y-cruncher?

y-cruncher is a multi-threaded, SIMD-vectorized program that computes mathematical constants — Pi, e, square roots, and more — to trillions of decimal digits. It's the tool of choice for:

World record Pi computations (every record since 2009 has used it)
CPU stress testing and overclocking validation
Memory subsystem benchmarking
Hardware stability detection (it'll find flaws that Prime95 and AIDA64 miss)

As of November 2025, the current world record stands at 314 trillion digits, computed in a single uninterrupted 110-day run on a 384-core AMD EPYC server. The verification took just 4.37 hours.

Why Should a Software Engineer Care?

Fair question. You're probably not computing Pi for a living. But y-cruncher is a goldmine of fascinating engineering decisions:

It exploits SIMD instruction sets (SSE, AVX, AVX-512) at a level most production software never touches
Its checkpoint-restart system is a masterclass in fault-tolerant distributed computation
It implements custom memory allocators that outperform the OS for specific access patterns
It demonstrates how multi-socket NUMA topology wreaks havoc on parallel performance — and how to fight back
Its benchmark results expose the memory bandwidth ceiling that most workloads never hit but y-cruncher constantly runs into

In short: reading about y-cruncher will make you a better systems programmer, even if you never run it.

Getting Started: Installation in Under 2 Minutes

Windows

Download y-cruncher v0.8.7.9547b.zip from the official site
Extract and run y-cruncher.exe
You may need the MSVC redistributable if you see DLL errors

Note: Antivirus false positives are common due to the low-level SIMD code. The binary is safe — but the static-linked version was reworked specifically to reduce false positives.

Linux

Choose between two variants based on your needs:

# Static — most portable, works on nearly any distro, no TBB/NUMA binding
wget http://www.numberworld.org/y-cruncher/y-cruncher\ v0.8.7.9547-static.tar.xz
tar -xf "y-cruncher v0.8.7.9547-static.tar.xz"
cd y-cruncher_v0.8.7.9547-static
./y-cruncher

# Dynamic — full features (NUMA binding, TBB) but requires Ubuntu 24.04+ or compatible
wget http://www.numberworld.org/y-cruncher/y-cruncher\ v0.8.7.9547-dynamic.tar.xz
tar -xf "y-cruncher v0.8.7.9547-dynamic.tar.xz"
./y-cruncher

System requirements:

64-bit x86/x64 processor
Windows 8+ or any 64-bit Linux distro
RAM: as much as you can get — more is almost always better

Running Your First Benchmark

When you launch y-cruncher, you'll get a console menu. For benchmarking:

Select "Benchmark" from the main menu
Choose a size (start with 250 million or 1 billion digits — comfortable for most modern desktops)
Watch it go

What you'll see reported:

Computation mode : Ram Only
Decimal Digits   : 1,000,000,000
Hexadecimal Digits: 830,482,023

Start Date       : ...
End Date         : ...

Total Computation Time : 14.670 seconds
Total Verification Time: 10.421 seconds
Total Time             : 25.091 seconds

Tip: "Total Computation Time" is the relevant benchmark number. "Total Time" includes verification, which is a separate algorithmic pass.

What are "good" numbers?

Here's a quick reference for 1 billion digits on common hardware (lower = better):

Hardware	Time (seconds)
Ryzen 9 9950X (16C, DDR5-6000)	~14.7s
Intel Core i9-13900KS	~15.9s
Ryzen 9 7950X (16C, DDR5-5200)	~16.8s
Ryzen 9 3950X (16C, DDR4-3200)	~29.5s
Core i7-11800H (laptop, 60W)	~32.3s

If your number is significantly higher than expected for your hardware, it's usually a memory configuration issue (see below).

The Memory Bandwidth Trap: Why Your Expensive CPU Might Be Underperforming

This is one of the most practically useful things y-cruncher teaches.

y-cruncher is memory-bound. Almost completely. On every high-end desktop since ~2012, the CPU sits and waits for data. This means:

GHz doesn't matter as much as memory bandwidth
Unpopulated DIMM slots hurt you more than you think
Memory frequency matters enormously

Real benchmark showing this on a Core i9-7940X:

Memory speed	10 billion digit time
DDR4-2666	365 seconds
DDR4-3466	322 seconds

That's a ~12% speedup just from faster RAM on the same CPU.

How to maximize memory bandwidth for y-cruncher

# Checklist:
# 1. Populate ALL memory channels
#    (4-channel platform = use 4 DIMMs, not 2)

# 2. Enable XMP/EXPO in BIOS
#    Most DDR4/DDR5 kits ship at JEDEC defaults (3200/4800)
#    XMP can push to 6000+ MT/s on DDR5

# 3. On Skylake-X specifically: also overclock L3 cache
#    (L3 bandwidth is an additional bottleneck on that architecture)

# 4. Enable Large Pages (Windows)
#    Run y-cruncher as Administrator for this to work
#    Post-Spectre/Meltdown mitigations cause up to 5% overhead
#    Large pages bypass the problematic page table walk

The Engineering Marvel: Checkpoint-Restart

Here's what separates y-cruncher from just being a fast calculator.

Computing 300 trillion digits of Pi takes months. On commodity hardware. With power outages, kernel panics, memory errors, and cosmic ray bit flips all waiting to destroy your work.

y-cruncher's solution is a robust checkpoint-restart system that:

Periodically snapshots computation state to disk
Verifies checkpoints with redundancy checks (catching hardware bit errors before they propagate)
Resumes automatically after any interruption — even after software bugs in y-cruncher itself have been fixed and the computation re-started

Several world record computations have survived bugs in y-cruncher itself because the checkpoint infrastructure caught the error, allowed a fix, and resumed from the last valid state. That's some serious fault tolerance engineering.

The 314 trillion digit record in November 2025 is remarkable specifically because it was the first recent record achieved without checkpointing — a single uninterrupted 110-day run. This is described as Storage Review's third attempt; the previous two were stopped by hardware/software issues.

For the software engineers in the room: this is a production-grade distributed systems problem solved elegantly in a desktop application.

Advanced: The NUMA Problem (And Why Multi-Socket Is Hard)

If you're running y-cruncher on a workstation with two CPUs — or benchmarking cloud instances — pay attention.

On multi-socket (NUMA) systems, memory access latency and bandwidth are asymmetric. Core 0 on Socket 0 reaches Socket 0's RAM in ~80ns. It reaches Socket 1's RAM in ~140ns. If your threads are spread across both sockets but chasing the same data, you'll hit contention that tanks throughput.

y-cruncher's documentation is explicit about this:

Benchmark numbers on multi-socket machines "may not be entirely representative of what the hardware is capable of"
The Push Pool vs Cilk Plus scheduler choice matters for machines with >64 cores
Node-interleaving in the BIOS should be disabled on Windows systems with >64 logical cores (otherwise you get imbalanced processor groups)

The load imbalance symptoms to watch for:

Core count is not a power of two
Cores are heterogeneous (hybrid architectures like Intel's P+E core designs)
Background processes stealing cycles from any single thread

Using y-cruncher for Stress Testing (The Real Killer App)

Many overclockers and hardware enthusiasts use y-cruncher specifically because it's uniquely good at exposing instability:

It simultaneously maxes out CPU computation AND the entire memory subsystem. Most stress tests only hit one or the other. y-cruncher hits both at the same time.

The telltale error you'll see on unstable hardware:

Redundancy Check Failed: Coefficient is too large

This means the two independent algorithmic passes (compute + verify) produced different results — which means either the CPU computed something wrong, or memory delivered corrupted data.

If you see this on a machine you thought was stable: don't ignore it. This is y-cruncher doing exactly what it's designed to do.

Common causes:

RAM running at XMP/EXPO speeds without sufficient voltage
CPU overclocked too aggressively (especially with AVX offsets not set)
Thermal throttling corrupting in-flight computation
Subtimings too tight on the memory controller

Understanding the Algorithms (For the Curious)

y-cruncher uses two independent algorithms for most constants:

Computation pass: Chudnovsky algorithm for Pi (exponentially converging series). Each term adds ~14.18 digits of Pi. The challenge is computing this in arbitrary precision — which requires implementing arithmetic on numbers with trillions of digits.

Verification pass: A different formula (e.g., Ramanujan's formula, or Bailey-Borwein-Plouffe variants) to independently confirm the result. If both passes agree to the last digit, the result is almost certainly correct.

The interesting engineering is in the big number arithmetic:

Uses Number Theoretic Transforms (NTTs) — the modular arithmetic equivalent of FFTs — for multiplication
Exploits SIMD vector units (AVX-512 on modern hardware) to parallelize the transform butterflies
Implements cache-oblivious algorithms for better memory access patterns during the transform

The result: multiplication of two N-digit numbers in O(N log N) time instead of O(N²). At a trillion digits, this difference is the gap between "computationally feasible" and "computationally impossible."

Practical Takeaways for Software Engineers

Whether you run y-cruncher or never touch it, here's what to take away:

1. Memory bandwidth is often your real bottleneck. Profile for it. Don't just assume your algorithm is CPU-bound.

2. Fault tolerance is a first-class feature. The world record wouldn't exist without checkpoint-restart. Think about what your long-running jobs do when a node dies.

3. NUMA topology changes everything in parallel code. Thread affinity and memory locality matter more than raw core count on multi-socket systems.

4. SIMD is still a performance multiplier worth understanding. A 4× speedup from AVX-2 or 8× from AVX-512 is not unusual for data-parallel numerical code. Compilers help, but hand-tuning helps more.

5. Stress test with something that actually stresses. y-cruncher's simultaneous CPU + memory load finds problems that single-threaded or DRAM-only tests miss entirely.

Where to Go From Here

Download y-cruncher: numberworld.org/y-cruncher
GitHub mirror: Available for HTTPS downloads (linked from the main site)
Benchmarks leaderboard: Rankings from 25 million to 1 trillion digits are published on the site
Advanced docs: The site has deep documentation on multi-threading internals, memory allocation strategies, and swap mode configuration
Mersenneforum subforum: For discussion with the community doing record-level runs

The fact that a program started as a high school project now drives the world's most demanding computational records — and teaches systems programmers lessons about memory bandwidth, fault tolerance, NUMA, and SIMD — is genuinely inspiring.

Go run it. Break your overclock. Then fix it and run it again.

DEV Community