Five Languages, One Loop: A Cross-Language Concurrency Benchmark - the Beginning

#performance #development #software #python

Every concurrency tutorial shows you how to go parallel. Almost none of them show you what it costs to go parallel.

That's not an oversight - it's a measurement problem. Reach for the usual benchmark and you'll pick a workload heavy enough to be worth parallelizing: parse the big file, crunch the large dataset, render the frame. And the moment you do, the interesting part vanishes. When each task takes a long time, the cost of starting that task - spawning the thread, queuing the job, waking a worker - rounds to nothing. You end up measuring the work, not the machinery. The scheduler does its job invisibly, and you walk away believing all roads to parallelism lead to roughly the same place.

They don't.

To see the machinery, you have to starve it. Shrink each task until the work it does is smaller than the work it takes to dispatch it. Then run a flood of those tiny tasks through every concurrency primitive the language offers - raw OS threads, a work-stealing pool, a classic thread pool, an async runtime - and time them not once, but enough times to see the shape of each one: the typical run, the bad run, the worst run that only shows up once in a thousand. At that scale, the overhead stops hiding. It becomes the entire signal.

So that's what I did - starting with Rust, because it's the clean room. No garbage collector pausing mid-measurement, no interpreter between you and the metal: just the runtime, the scheduler, and the bare cost of asking a core to do something. It's the right place to establish a baseline before the later experiments bring GC and interpreted runtimes into the picture, where this same test gets a lot more interesting.

I expected the pooled approaches to win and the naive ones to lose. That part held. What I didn't expect was that one of the parallel strategies would lose to not being parallel at all - that for small enough work, the fastest way to run a hundred tasks is to refuse to spread them out. There's a crossover point hidden in here, a line where "just add threads" flips from speedup to self-sabotage, and most code I've seen in the wild sits closer to that line than its authors think.

Here's where it is, and here's the runtime model that explains why.

The Workload, and What It Actually Measures

The task is one fixed quantum of arithmetic: ten thousand terms of the Leibniz series for $π$, a tight floating-point loop with a loop-carried dependency so nothing in it can be vectorized away or reordered into nothing. It runs in a few microseconds. That number is the whole point. It's small enough that the cost of handing the loop to a thread is on the same order as running the loop - which is exactly the regime where dispatch becomes visible.

So read this as it is not a CPU benchmark. The $π$ loop is a costume or just a wrapper. Nothing here is trying to find out which runtime divides floats fastest - at this size the divider is barely warm. What's being measured is everything around the loop: the cost of spawning a thread, handing a job to a queue, stealing work off another core's dequeue, waking a parked worker, hopping a result across an isolate boundary. The arithmetic is just a known, fixed thing to make the machinery bring me a result.

This only means anything in the regime where dispatch costs more than the work. Scale the loop up - a hundred thousand terms, a million - and every gap in this whole series closes. The pooled variants, the naive variants, the serial baseline all converge, because once each task is expensive enough, who cares what it cost to start. That convergence isn't a flaw in the test; it's the boundary of where the test has anything to say. Everything below is true inside that boundary and false outside it, and that's stated up front so no one has to discover it the hard way.

Each run triggers a hundred of these tasks through one concurrency primitive and waits for all hundred to finish. That full fan-out is one sample. Every variant gets one warm-up fan-out - to prime instruction caches, branch predictors, and, where it applies, to force a thread pool to actually build itself before it's on the clock - and then a thousand timed samples.

Reading the Numbers

A mean value would be definitely wrong here. Dispatch cost isn't a single value; it's a distribution with a tail, and the tail is where the interesting failures live - the run that hit a scheduler hiccup, the one that caught a collector mid-sweep, the one that got bounced onto a slow core. So every variant is reported as three points on its distribution: p50 (the median - the typical run), p99(the bad-but-not-rare run), and p999 (the worst run in roughly a thousand). With a thousand samples, p999 lands on rank ~998 - it is a genuine one-in-a-thousand observation, not a synonym for "the slowest sample." When p999 pulls far away from p50, something intermittent is reaching into the measurement, and naming which something - a GC pause, a clock transition, a cold pool - is most of the work of the later articles.

One methodological note that matters more than it looks. A loop whose result is never used is a loop a good optimizer is entitled to delete, and at --release Rust will absolutely delete it - leaving you timing an empty fan-out and reporting glorious nonsense. Every variant here folds each task's result into a single global accumulator (an XOR of the raw float bits) that's printed once at the very end. The result escapes; the loop has to run. That same accumulator does a second job: it keeps every print out of the timed region. A per-task log line would serialize the parallel workers on the output lock and quietly convert the thing you're measuring into a benchmark of stdout. The work goes into the accumulator during timing; the accumulator goes to the terminal after.

What I can't Control For

These numbers were taken on an Apple M2 Max, and Apple silicon gives you no CPU governor to hide behind. You can't pin the clocks, you can't disable boost, and you can't stop the scheduler from migrating a task between the eight performance cores and the four efficiency cores. A task that lands on an E-core, or a performance cluster that quietly down-clocks partway through a run, shows up as a p99 or p999 tail that came from the machine, not from the runtime.

On a Linux box it's possible to set the performance governor and switch off SMT to suppress exactly this; on this hardware you can't, so the honest move is to name it rather than pretend it's absent. It's another reason the full distribution earns its place over a single number: when the tail jumps and the median doesn't, that's usually the silicon moving, not the scheduler.

Test Environment

All measurements were taken on a single machine, cold-start: one warm-up run, then 1000 timed fan-outs per variant. No warm/cold matrix, no cache manipulation.

Keys	Values
Machine	MacBook Pro (Mac14,6), Apple M2 Max
Cores	12 (8 performance + 4 efficiency)
Memory	32 GB
OS	macOS Tahoe 26.5.1 (Darwin kernel 25.5.0), arm64

Runtimes (exact, pinned)

Language	Version	Build / flags
Rust	rustc 1.92.0 (ded5c06cf 2025-12-08)	`--release`
Go	go1.26.4 darwin/arm64	default toolchain
Node.js	v22.17.1	worker_threads
Python	3.14t free-threaded, PGO+LTO	no-GIL build
Java	Oracle JDK 25.0.2+10-LTS, HotSpot	G1 GC, `-Xms2g -Xmx2g`

Next Step

We start with Rust because it strips the question down to its bones. Ahead-of-time compiled, no garbage collector, no interpreter loop - when a Rust number moves, nothing is mediating between the code and the scheduler, so the number is the dispatch cost and nothing else. That makes it the control for everything that follows: once you know what a hundred tiny tasks cost through raw threads, through rayon, through a thread pool, and through an async runtime on a machine with nothing in the way, you have the yardstick to read Go's scheduler, Java's JIT and its GC tail, Python's interpreter wall, and Node's isolate boundary against.

And it's where the surprise starts. Two of these strategies lose to a plain serial loop - and they lose for two completely different reasons. That's the next article.