The interesting part of Box2D v3 isn't the cache. It's the graph coloring.

#programming #opensource

Erin Catto's engines have always had a reputation for being readable. Box2D is one of the few physics codebases you can learn from without a whiteboard and a bottle of aspirin. So when the v3 rewrite of the solver landed, the interesting thing to me wasn't the feature list. It was that Catto solved a problem most parallel-solver writeups quietly skip: how do you run a Gauss-Seidel iterative solver on many threads when the whole point of Gauss-Seidel is that each constraint reads the updated velocities the previous constraint just wrote?

The consensus take on Box2D these days is "cool, it got a big performance rewrite, and there's finally momentum around open-source physics." Fine. The second most common conversation — "data-oriented design, structure-of-arrays beats pointer chasing, the cache is the bottleneck" — is a Mike Acton talk from 2014 that's been re-blogged ten thousand times. Yes, Box2D v3 packs its solve-phase data into tight arrays and uses integer indices instead of pointers. If you haven't internalized that, go internalize it. It's table stakes now, not an insight.

The genuinely hard, genuinely non-obvious part is what happens after you've made your data cache-friendly and now want threads. And it's worth saying up front: this is a 2D engine. The technique below generalizes to 3D solvers — the constraint graph doesn't care about dimensionality — but I'm describing what's actually in the Box2D v3 source, not extrapolating about a port I haven't read.

The dependency you can't just ignore

A rigid-body solver does something like this per iteration:

for each contact constraint:
    read velocities of bodyA and bodyB
    compute impulse
    write updated velocities back to bodyA and bodyB

The catch is that a single body is usually in many contacts. A box resting on the ground with three neighbors leaning on it might appear in a dozen constraints. If you naively split the constraint array across four threads, two threads will happily read-modify-write the same body's velocity at the same time. Best case you get a torn update and jittery, non-deterministic simulation. Worst case it explodes.

The obvious fixes are all bad:

Locks per body. Now your inner loop — which is supposed to be a tight block of arithmetic — takes a mutex twice per constraint. The contention on hot bodies (the ground, a wall) serializes you back to single-threaded, plus overhead.
Atomics. Same story, and floating-point atomic add isn't free or even universally available in the form you want.
Jacobi instead of Gauss-Seidel. Read all velocities from the start-of-iteration snapshot, accumulate deltas, apply at the end. Embarrassingly parallel. Also converges noticeably slower, so you burn the thread savings on extra iterations and the stacks feel spongy.

So you're stuck: the algorithm that converges well is inherently sequential, and the parallel version of it converges badly.

Coloring your way out

Box2D v3's answer — and this is the part worth reading the source for — is graph coloring. Think of the constraint graph: bodies are nodes, each contact is an edge. Two constraints conflict only if they share a body. If they don't share a body, they touch completely disjoint velocity data, and you can solve them simultaneously with zero synchronization.

That's exactly a graph-coloring problem. Assign each constraint a color such that no two constraints of the same color share a body. Every constraint within one color is guaranteed conflict-free, so you can hand a whole color to a thread pool and let it rip — no locks, no atomics — then move to the next color.

for each color:            // sequential
    parallel-for each constraint in this color:
        solve it           // no two of these touch the same body

You still get true Gauss-Seidel behavior across colors — color N sees the velocity updates color N-1 produced. You get full parallelism within a color. You've turned "sequential because of data dependencies" into "sequential only at the granularity of colors," and colors are few.

The build uses a greedy coloring: walk the constraints, drop each into the lowest-numbered color whose bodies it doesn't already collide with. Greedy coloring is not optimal — finding the minimum number of colors is NP-hard — but you don't need optimal. You need few, and greedy on these sparse contact graphs gives you few. Box2D caps the color count and dumps any leftover constraints into an overflow bucket solved on a single thread. That overflow set is small enough in practice that the single-threaded tail doesn't dominate.

Two things about this that took me a second to appreciate:

Coloring is a per-step cost you pay anyway. Contacts change every step — bodies move, some pairs separate, new ones touch. So the coloring has to be rebuilt or incrementally maintained each step. That's overhead the single-threaded solver never pays. The design bet is that the coloring cost is cheaper than what you save by going wide. On a scene with enough bodies to matter, it is. On a scene with 30 bodies, coloring is pure loss — which is why this is a scaling technique, not a free win.

Coloring and cache locality reinforce each other. Here's the part that surprised me. Once you group constraints by color, you can lay out each color's constraint data contiguously and stream it. The parallelism partition also happens to be a locality-friendly ordering. This is the actual reason "readable" and "fast" aren't at war in this codebase: the same decomposition that unlocks lock-free threading also gives each thread a clean sequential run over its slice. One idea, two payoffs.

Why this beats the fix you'd reach for

If you've written a parallel solver before, your instinct was probably atomics or a Jacobi relaxation, because those are the tricks that show up first when you search for "parallel constraint solver." Coloring is less obvious because it front-loads the cleverness: you spend effort organizing the work so the hot loop can be dumb and fast, instead of making the hot loop smart enough to be safe.

That's a general shape worth stealing. When you have a loop that's hard to parallelize because iterations touch shared state, the reflex is to protect the shared state. The better question is often: can I partition the iterations so that within a partition, nothing is shared at all? If you can, the partition is free of synchronization by construction, and you only pay coordination cost at the seams between partitions.

I've used the same trick well outside physics:

Spatial systems. Bucket entities by grid cell; entities in non-adjacent cells never interact this frame, so you can update them in parallel with no locks. That's graph coloring on a spatial graph, whether or not you call it that.
Batched graph mutations. If you're applying a queue of edits to a shared structure, group the edits by which region they touch. Disjoint regions go concurrent.

The point isn't "use graph coloring." It's that the hardest parallelism problems are usually scheduling problems in disguise. You get further by proving two pieces of work are independent than by making dependent work thread-safe.

What to actually take away

Don't cargo-cult the coloring into a 40-body scene and wonder why it's slower — the setup cost only amortizes when you have enough constraints to fan out across cores, and only when a meaningful fraction of them land in the parallel colors rather than the overflow bucket. If you want to know whether it's paying off in your own code, the honest measurement isn't wall-clock alone; it's whether your threads are actually running lock-free work or stalling on a shared structure. perf stat won't show you a lock you didn't take, but it'll show you the IPC drop when threads fight over a cache line, and a contention profiler will show you the mutex you should have designed away.

And read the Box2D v3 solver source before you write your own — 2D or 3D, the shape of the problem is the same. The data layout is the part everyone copies. The coloring is the part everyone reinvents badly — usually by reaching for atomics first and discovering, three days in, that the real problem was never how to make the write safe. It was how to arrange the work so the write was never contended in the first place.