SEN LLC

Posted on Apr 20

I Put the Same Mandelbrot Kernel on the CPU and the GPU — and Watched float32 Crack

#webgl #javascript #graphics #performance

I Put the Same Mandelbrot Kernel on the CPU and the GPU — and Watched float32 Crack

One rendering loop, two implementations: vanilla JavaScript on the main thread, and the same loop in a GLSL fragment shader. At shallow zoom the GPU is 30–100× faster. Zoom past a certain scale and the GPU renders a visibly different — and wrong — image. This is that scale.

📦 GitHub: https://github.com/sen-ltd/webgl-mandelbrot
🧮 Demo: https://sen.ltd/portfolio/webgl-mandelbrot/

That screenshot is the demo's float32 breaks preset: one viewport, rendered twice at 512×512 with 500 iterations. The left panel is the JavaScript CPU renderer, the right is the WebGL GPU renderer, and they are running the identical escape-time kernel. The panels look different because one is working and the other isn't.

The setup

The Mandelbrot set is defined by iterating z ← z² + c starting from z = 0. If |z| stays bounded forever, c is in the set. In practice you bail out when |z|² > 4 (a proven escape radius) or when you've done too many iterations. The iteration count at escape is what you colour.

The inner loop is tiny:

// JavaScript, one pixel
export function escapeCount(cx, cy, maxIter) {
  let x = 0, y = 0;
  let x2 = 0, y2 = 0;
  for (let i = 0; i < maxIter; i++) {
    if (x2 + y2 > 4) return i;
    y = 2 * x * y + cy;
    x = x2 - y2 + cx;
    x2 = x * x;
    y2 = y * y;
  }
  return maxIter;
}

And the GLSL version is the same loop, word for word:

// Fragment shader, one pixel per invocation, in parallel
float x = 0.0, y = 0.0;
float x2 = 0.0, y2 = 0.0;
int iter = MAX_ITER;
for (int i = 0; i < MAX_ITER; i++) {
  if (x2 + y2 > 4.0) { iter = i; break; }
  y = 2.0 * x * y + c.y;
  x = x2 - y2 + c.x;
  x2 = x * x;
  y2 = y * y;
}

One gotcha: GLSL ES 1.00 requires for loops to have a constant upper bound, so MAX_ITER is baked into the shader as a #define at compile time. Raise the max-iter slider in the UI and the renderer recompiles the shader. That's the only structural difference between the two implementations.

The speed gap

On my laptop, a 512×512 render at 500 iterations zoomed into the seahorse valley looks like this:

Render	ms	Relative
CPU	~230	1×
GPU	~2.5	~90×

Not surprising: the CPU is one thread doing 262 144 pixels × up to 500 iterations in series. The GPU is doing them all at once, 32 or 64 at a time depending on the warp/wavefront width, across however many compute units the driver decides to give the fragment shader.

What is worth saying: measuring GPU time honestly in WebGL is awkward. gl.drawArrays is asynchronous; the returned call doesn't mean anything has rendered yet. gl.finish() is unreliable across drivers. What does work is issuing a gl.readPixels(0, 0, 1, 1, …) after the draw, which forces the pipeline to flush because the pixel can't come back until the frame is done. The demo does this ten times and divides by ten. The number you see is honest wall-clock time from "call render" to "pixel is ready to read".

Where it gets interesting

The Mandelbrot set is self-similar. You can zoom in forever, and at every scale you find new filaments, new spirals, new miniature copies of the full set. Zooming is just shrinking the viewport's scale parameter — the part that maps pixel coordinates into the complex plane:

vec2 c = uCenter + vec2(uv.x * uScale * aspect, uv.y * uScale);

Here uv.x is in [-0.5, 0.5], uScale is the complex-plane height of the view, and aspect corrects for a non-square canvas. To zoom in a thousand times, you reduce uScale a thousand times.

Each pixel's c is a value like uCenter + 0.49 × uScale × aspect. When uScale gets very small, that's a large number being added to a tiny one. And c is a float. In GLSL ES 1.00, highp float means IEEE 754 single precision — 1 sign bit, 8 exponent bits, 23 mantissa bits. Relative precision is 2⁻²³ ≈ 1.19 × 10⁻⁷.

In the seahorse valley the centre is around (-0.744, 0.132), so the absolute precision of uCenter + … is about 0.75 × 1.19 × 10⁻⁷ ≈ 9 × 10⁻⁸. The moment the per-pixel distance uScale / canvasHeight falls below that, adjacent pixels round to the same complex number. Many neighbours — sometimes a dozen — compute identical iteration counts. The image stops being a fractal and becomes a block mosaic.

This is the screenshot above. At uScale = 5 × 10⁻⁶ on a 512-pixel canvas, pixel spacing is ~10⁻⁸, which is an order of magnitude below float32 resolution. The CPU does the same math with doubles (52 mantissa bits, ~2.2 × 10⁻¹⁶ relative precision) and doesn't notice. The GPU gives you a picture like it's been through a JPEG set to quality 0.

"Just use more precision"

Not an option, at least not cheaply. GLSL ES 1.00 has no double. WebGL 2 / GLSL ES 3.00 still doesn't have one in the core spec; doubles appear in desktop OpenGL 4.0+ and require the GL_ARB_gpu_shader_fp64 extension. For anything targeting a browser, you're stuck with float.

The two workarounds the deep-zoom Mandelbrot community actually uses:

Double-single / float-float arithmetic. Represent each coordinate as a pair of floats (hi, lo) that sum to the value you wanted. Addition and multiplication are expanded into Knuth-style compensated sequences. You pay ~2–4× the ALU cost and roughly double the bandwidth, but you buy ~14 decimal digits of precision. The shader becomes significantly more complex, and -ffast-math-style optimisations will happily break your compensated arithmetic if you let them.
Perturbation methods. Pick a reference pixel, compute its orbit in double once on the CPU. Every other pixel computes its orbit as a small float delta from the reference. Because the deltas are near zero, the relative precision of float now applies to distances of ~10⁻¹⁴ rather than to coordinates of ~1.0. There are known numerical landmines around glitch detection — the reference orbit can diverge from the neighbourhood and the deltas blow up — and patching them (Zhuoran's algorithm, BLA) is a paper's worth of work. Production fractal software like Kalles Fraktaler does this.

Neither is going into this demo. The point of the demo is to see the wall, not to route around it.

The lesson that transfers

Mandelbrot is just the vehicle. The precision story is the lesson: if you compute uCenter + uv * uScale in any shader, and both uCenter and uScale * something_small are part of the value you care about, you have the same bug waiting for you at deep zoom. Tile rendering, Google Maps–style mercator projection, deferred shading of very-far-from-origin cameras — anywhere the useful information is in the low bits of a difference between two larger numbers, single precision runs out in the same way.

The fix is usually the same shape: rearrange the math so you're adding small to small, not small to big. Translate the camera/coordinate origin into the neighbourhood first (in double on the CPU), then do everything else relative to that origin (in float on the GPU). That's the core idea behind perturbation Mandelbrot, and it's also why CAD software stores world coordinates in doubles and only sends float deltas to the vertex shader.

What else the demo does

Pan and zoom that stay locked between the CPU and GPU panels, so you can zoom and see exactly when they drift.
Six presets that walk through increasing zoom depth — the last is the one where the GPU cracks.
Max-iter slider that recompiles the shader on the fly (constant-bound loop, remember).
Honest GPU timing via readPixels flush, averaged over ten frames.

All in ~160 lines of JavaScript and ~45 lines of GLSL, with no bundler and no dependencies.

Try it

Click through the presets from left to right. Up through Spiral the two panels are indistinguishable. At Deep the GPU panel starts showing faint vertical banding. At float32 breaks the panel is a block mosaic. That is the moment you crossed 24 mantissa bits.

Part of the SEN portfolio — 200+ public builds.

DEV Community

I Put the Same Mandelbrot Kernel on the CPU and the GPU — and Watched float32 Crack

I Put the Same Mandelbrot Kernel on the CPU and the GPU — and Watched float32 Crack

The setup

The speed gap

Where it gets interesting

"Just use more precision"

The lesson that transfers

What else the demo does

Try it

Top comments (0)