Kim Brandwijk

Posted on Mar 16

From Flickering Histograms to 10-Bit HDR: Debugging a WebGPU Camera Pipeline in React Native

#reactnative #webgpu

In the last post, I built a zero-copy GPU compute pipeline for live camera frames in React Native — IOSurface import, Dawn compute shaders, Skia Graphite rendering, 4K at 120fps. That post ended with buffer readback and histogram overlays working.

What I didn't cover was everything that broke between "it works" and "it actually works." The histogram flickered. The canvas overlay disappeared. A 33MB dead memcpy hiding in the camera delegate tanked performance by 5.5x. A confident theory about thread-local recorders turned out to be completely wrong.

This post covers the debugging journey from multi-pass rendering through to Apple Log — iOS 17's 10-bit log-encoded camera format, with LUT-based color grading running entirely on the GPU.

The 33MB/Frame Dead Memcpy

After multi-pass compute was working, I noticed something wrong: 22fps at 4K. The GPU profiler showed 2.88ms compute time — plenty of headroom at 120Hz. So where was the time going?

I added per-step timing instrumentation to the native pipeline:

auto t0 = std::chrono::high_resolution_clock::now();
// ... import IOSurface ...
auto t1 = std::chrono::high_resolution_clock::now();
// ... create bind groups ...
auto t2 = std::chrono::high_resolution_clock::now();
// ... dispatch compute ...

The instrumentation showed the problem immediately: the camera delegate's captureOutput:didOutputSampleBuffer:fromConnection: was copying the entire pixel buffer before passing it to the pipeline. At 4K BGRA, that's 3840 × 2160 × 4 = ~33MB of memcpy per frame. For no reason — the pipeline imports the IOSurface directly from the CVPixelBuffer. The copy was dead code left over from an earlier approach.

Removing one line restored 120fps. The timing metrics became a permanent part of the pipeline, exposed to JS via JSI:

[FPS] 120fps | lock=0.02ms import=0.31ms bind=0.08ms compute=2.88ms buf=0.01ms skImg=0.12ms

Lesson: Before debugging the GPU, profile the CPU. The most expensive operation in the pipeline was happening before the GPU was even involved.

The Histogram Flicker

With performance fixed, the next problem was visual: the histogram overlay flickered. Every few frames, it would vanish for one frame and reappear.

The setup: a compute shader runs a histogram pass, writing both to the output texture (passthrough) and to a storage buffer (bin counts). A Skia canvas draws the histogram bars directly onto the compute output texture via WrapBackendTexture. The composited result — video frame with histogram overlay — becomes the final SkImage.

The flicker happened because the compute pipeline and Skia canvas were sharing the same GPU texture. Here's the race:

Frame N:
  Camera thread: processFrame() dispatches compute → writes to texA
  Camera thread: Skia canvas draws histogram bars on texA
  Camera thread: MakeImageFromTexture(texA) → outputImage
  UI thread: reads outputImage, renders to screen ✓

Frame N+1:
  Camera thread: processFrame() dispatches compute → writes to texB
  UI thread: still rendering frame N's outputImage (backed by texA) ✓
  Camera thread: next processFrame() dispatches compute → writes to texA  ← BOOM
  UI thread: texA just got overwritten mid-render

The UI thread was reading a texture that the camera thread was simultaneously overwriting with the next frame's compute output. With ping-pong textures, the overwrite happens every other frame — hence the flicker, not a solid corruption.

Fix: Separate Canvas Texture + Split Mutex

The fix had two parts:

1. Dedicated canvas texture. Instead of drawing Skia overlays directly onto the compute output texture, copy the compute result to a separate canvasTex first, then draw overlays on canvasTex. The compute pipeline never touches canvasTex, so the UI thread can safely read it.

// Copy compute output to canvas texture (GPU-side, fast)
encoder.CopyTextureToTexture(&src, &dst, &copySize);

// Now draw Skia overlays on the stable copy
auto surface = SkSurfaces::WrapBackendTexture(recorder, canvasBackendTex, ...);
// ... draw histogram bars ...
surface->flush();

2. Split the mutex. The original implementation held a single mutex for the entire processFrame() — import, bind group creation, compute dispatch, buffer readback, SkImage creation. That's ~3ms of lock hold time, during which the UI thread blocks on every nextImage() call.

The fix: run all GPU work unlocked, then take the lock only to publish results:

void processFrame(CVPixelBufferRef pixelBuffer) {
  // All GPU work runs WITHOUT holding the lock
  auto ioSurface = CVPixelBufferGetIOSurface(pixelBuffer);
  // ... import, bind groups, compute dispatch, buffer copy ...
  // ... canvas draw, MakeImageFromTexture ...

  // Brief lock to swap published state (~0.05ms)
  {
    std::lock_guard<std::mutex> lock(_mutex);
    _impl->outputImage = newImage;
    _impl->generation++;
  }
}

Lock hold time went from ~3ms to ~0.05ms. The UI thread no longer stalls waiting for compute to finish.

Lesson: GPU pipeline debugging is hard because the symptoms (visual flicker) don't obviously point to the cause (shared texture + lock contention). Timing instrumentation and understanding the frame lifecycle are essential.

The Theory That Was Wrong

During the multi-pass rewrite, rendering broke completely — solid nothing on screen. The SkImage existed (non-null), had the right dimensions, but rendered as empty.

I had a confident theory: Skia Graphite uses thread-local Recorder objects for GPU command recording. processFrame() runs on the camera thread. The Skia <Canvas> renders on the Reanimated UI thread. If the SkImage was created on the camera thread's recorder, maybe the UI thread's recorder couldn't use it.

This led to an elaborate deferred-creation system:

// processFrame() — camera thread
_impl->imageDirty = true;  // mark for lazy recreation

// getOutputSkImage() — UI thread
if (_impl->imageDirty) {
  _impl->outputImage = ctx.MakeImageFromTexture(
    *finalTex, width, height, format);
  _impl->imageDirty = false;
}
return _impl->outputImage;

The theory sounded right. Thread-local recorders are a real Graphite concept. The code was clean. It compiled. It produced frames.

The frames were empty.

I spent time adding synchronization, trying different creation points, ensuring the texture was valid. Nothing worked. Then I looked at the single-pass version — the one that was working perfectly — and realized it created the SkImage directly in processFrame(), on the camera thread. The exact thing my theory said shouldn't work.

I reverted to the simple approach: MakeImageFromTexture() in processFrame(), return the cached image in getOutputSkImage(). Rendering worked immediately.

The thread-local recorder theory was either wrong or not the actual problem. I documented this in the architecture decisions log:

AD-1: SkImage creation in processFrame (camera thread)
We initially believed Skia Graphite's thread-local Recorder would prevent an SkImage created on the camera thread from rendering on the UI thread. The deferred approach produced frames but they rendered as nothing. Reverting to create the SkImage directly in processFrame() — matching the working single-pass implementation — fixed rendering immediately.

Lesson: Confident theories about GPU threading are still just theories. When debugging, always have a known-working reference to diff against. I'd made two changes simultaneously (deferred SkImage creation + RGBA view format override), couldn't tell which one caused the problem, and the one I was most confident about wasn't the fix.

Obtaining a Graphite Recorder (The 1×1 Hack)

For the canvas overlay feature, I needed a skgpu::graphite::Recorder* to call WrapBackendTexture. But DawnContext::getRecorder() is private in react-native-skia. getWGPUDevice() and getWGPUInstance() are public — the recorder just wasn't exposed.

The workaround: create a 1×1 throwaway offscreen surface and extract its recorder:

auto tempSurface = ctx.MakeOffscreen(1, 1);
auto* recorder = tempSurface->recorder();

This works because the recorder is a thread-local singleton — all surfaces on the same thread share the same instance. The 1×1 surface is created once during setup().

I filed PR #3751 to make getRecorder() public, consistent with the other Dawn accessors. Once merged, the hack can be replaced with ctx.getRecorder().

Depth Estimation: ML on the Same GPU

Before diving into Apple Log, I took a detour: running a depth estimation model on camera frames using the same Dawn GPU device.

Transformers.js (Hugging Face's JS ML library) supports WebGPU as an execution backend. Since Skia Graphite already provides navigator.gpu via its Dawn JSI bridge, the model can run on the same GPU context — no separate GPU initialization, no data copies between contexts.

The demo runs Depth Anything V2 Small on camera frames. The interesting part isn't the ML itself — it's the plumbing. Transformers.js expects browser APIs (Canvas, Blob, Image) that don't exist in React Native. The workaround: take the model's raw output (Float32Array of depth values), normalize to grayscale bytes, and construct an SkImage directly:

const rgba = new Uint8Array(pixelCount * 4);
for (let i = 0; i < pixelCount; i++) {
  const v = depthData[i];
  rgba[i * 4 + 0] = v;
  rgba[i * 4 + 1] = v;
  rgba[i * 4 + 2] = v;
  rgba[i * 4 + 3] = 255;
}
return Skia.Image.MakeImage(info, Skia.Data.fromBytes(rgba), width * 4);

This is the "don't do this in production" pixel copy approach from Part 1 — acceptable for a demo that updates a few times per second, but not for 120fps video. The real solution is ONNX Runtime's WebGPU execution provider, which accepts an external Dawn device via string-encoded pointers and keeps everything GPU-side. That's next.

Apple Log: 10-Bit HDR Camera Frames

Now the main event. Everything above was the prologue — fixing bugs, adding features, and building the infrastructure that makes Apple Log possible.

What is Apple Log?

Standard iPhone footage is "finished" by the ISP before it reaches your app — tone mapped, color graded, dynamic range compressed. Apple Log is the raw-ish alternative: a flat, log-encoded format with ~15 stops of dynamic range. Highlights that would be clipped are preserved. Shadows that would be crushed are recoverable. The footage looks awful until you apply a LUT (Look-Up Table) to map it to a displayable color space.

Professional video apps have supported this since iOS 17. I wanted it in the WebGPU compute pipeline — capture in Apple Log, convert YUV→RGB on the GPU, apply a .cube LUT, render through Skia Graphite. All zero-copy.

The Problem: 10-Bit Bi-Planar YUV

Standard frames come as kCVPixelFormatType_32BGRA — four bytes per pixel, one plane, simple. Apple Log frames come as kCVPixelFormatType_420YpCbCr10BiPlanarVideoRange:

420: Chroma subsampled — UV is half resolution in each dimension
YpCbCr: Luminance and chrominance are separate
10Bi: 10 bits per component in 16-bit containers
Planar: Y and UV in separate memory planes
VideoRange: Values don't span full 0–1023 (Y: 64–940, UV: 64–960)

Dawn Multi-Planar Textures

Dawn supports this natively, but needs two device features enabled:

if (adapter.HasFeature(wgpu::FeatureName::DawnMultiPlanarFormats)) {
  features.push_back(wgpu::FeatureName::DawnMultiPlanarFormats);
}
if (adapter.HasFeature(wgpu::FeatureName::MultiPlanarFormatExtendedUsages)) {
  features.push_back(wgpu::FeatureName::MultiPlanarFormatExtendedUsages);
}

DawnMultiPlanarFormats enables the bi-planar texture format. MultiPlanarFormatExtendedUsages allows using it as a TextureBinding — without this, you can import the texture but can't sample it in a shader.

Importing the IOSurface

The IOSurface is imported once as a multi-planar texture, then split into per-plane views:

inputTexDesc.format = wgpu::TextureFormat::R10X6BG10X6Biplanar420Unorm;
inputTexture = sharedMemory.CreateTexture(&inputTexDesc);

// Y plane: full resolution, single channel
wgpu::TextureViewDescriptor yViewDesc{};
yViewDesc.aspect = wgpu::TextureAspect::Plane0Only;
yViewDesc.format = wgpu::TextureFormat::R16Unorm;
yPlaneView = inputTexture.CreateView(&yViewDesc);

// UV plane: half resolution, two channels
wgpu::TextureViewDescriptor uvViewDesc{};
uvViewDesc.aspect = wgpu::TextureAspect::Plane1Only;
uvViewDesc.format = wgpu::TextureFormat::RG16Unorm;
uvPlaneView = inputTexture.CreateView(&uvViewDesc);

I initially spent time looking for a plane field on SharedTextureMemoryIOSurfaceDescriptor for per-plane import. It doesn't exist — the API handles planes at the view level, not the import level. One import, two aspect views. Dawn handles the half-resolution UV plane implicitly.

Finding all of this required reading Dawn's test files and WebGPU spec extension proposals. There's no "how to import a YUV IOSurface" tutorial anywhere.

The YUV→RGB Shader

The pipeline auto-inserts this as pass 0 when appleLog is true:

@group(0) @binding(0) var yPlaneTex: texture_2d<f32>;
@group(0) @binding(1) var uvPlaneTex: texture_2d<f32>;
@group(0) @binding(2) var outputTex: texture_storage_2d<rgba16float, write>;

@compute @workgroup_size(16, 16)
fn main(@builtin(global_invocation_id) id: vec3u) {
  let dims = textureDimensions(yPlaneTex);
  if (id.x >= dims.x || id.y >= dims.y) { return; }

  let yRaw = textureLoad(yPlaneTex, vec2i(id.xy), 0).r;
  let uvRaw = textureLoad(uvPlaneTex, vec2i(id.xy / 2u), 0).rg;

  // Video range expansion (10-bit: 64/1023, 940/1023, 960/1023)
  let y = (yRaw - 0.06256) / (0.91887 - 0.06256);
  let cb = (uvRaw.r - 0.06256) / (0.93842 - 0.06256) - 0.5;
  let cr = (uvRaw.g - 0.06256) / (0.93842 - 0.06256) - 0.5;

  // BT.2020 YCbCr → RGB
  let r = y + 1.4746 * cr;
  let g = y - 0.16455 * cb - 0.57135 * cr;
  let b = y + 1.8814 * cb;

  // Do NOT clamp — values outside [0,1] are valid HDR data
  textureStore(outputTex, vec2i(id.xy), vec4f(r, g, b, 1.0));
}

Three things matter: video range expansion (Apple Log uses 64–940, not 0–1023), BT.2020 matrix (not BT.709 — subtle color difference), and no clamp (the whole point of Apple Log is values > 1.0 for super-bright highlights).

RGBA16Float Everywhere

The original pipeline used RGBA8Unorm ping-pong textures — 8 bits per channel, [0, 1] range. Apple Log needs RGBA16Float to preserve HDR values > 1.0. But WebGPU requires the shader's texture_storage_2d<format> to exactly match the texture format. No implicit conversion.

Rather than maintaining two shader variants for every user shader, I switched the entire pipeline to RGBA16Float — both SDR and Apple Log modes. It costs 2x texture memory for SDR (~66MB vs ~33MB at 4K), but the GPU has 6-8 GB and the complexity savings are worth it. One format everywhere means every shader works in both modes without changes.

This decision touched every texture allocation, every MakeImageFromTexture call, every built-in shader, and every example shader. Lots of files, one conceptual change.

The LUT Shader

With Apple Log RGB in RGBA16Float, a .cube LUT maps it to a displayable color space. The shader uses the input RGB as 3D texture coordinates:

@group(0) @binding(3) var lutTex: texture_3d<f32>;
@group(0) @binding(4) var lutSampler: sampler;

let lutCoord = clamp(color.rgb, vec3f(0.0), vec3f(1.0));
let lutColor = textureSampleLevel(lutTex, lutSampler, lutCoord, 0.0);

textureSampleLevel, not textureSample — this tripped me up. textureSample requires implicit derivatives (fragment shader only). textureSampleLevel takes an explicit mip level and works in compute.

The clamp is intentional here: the LUT only covers [0, 1]. Values outside would sample garbage. The LUT's edge entries already handle extreme highlights and shadows.

The Resource System

Getting LUT data to the shader required a new abstraction — typed GPU resources uploaded once at setup and bound to shader passes:

const { currentFrame } = useGPUFrameProcessor(camera, {
  resources: {
    lut: GPUResource.texture3D(lutData, {
      width: 33, height: 33, depth: 33,
      format: 'rgba32float',
    }),
  },
  pipeline: (frame, res) => {
    'worklet';
    frame.runShader(LUT_WGSL, { inputs: { lut: res.lut } });
  },
});

The capture proxy assigns handle IDs during pipeline setup and records which resources each pass needs. The native side uploads data once and creates bind group entries. The same system handles 3D textures, 2D textures, and storage buffers — general-purpose GPU resource input.

Two Ways to Use the Output

I mentioned in Part 1 that the pipeline has two consumption paths for buffer data. The same principle applies to the entire processed frame, and it matters more with Apple Log because you're making a choice about what gets recorded.

Burn-in via onFrame: Drawing directly onto the GPU texture. Anything drawn here — LUT-graded video, histogram overlays, waveform monitors, focus peaking — becomes part of the frame. It shows up in recordings, screenshots, and exports.

const { currentFrame } = useGPUFrameProcessor(camera, {
  resources: { lut: lutResource },
  pipeline: (frame, res) => {
    'worklet';
    frame.runShader(LUT_WGSL, { inputs: { lut: res.lut } });
    return {};
  },
  onFrame: (frame) => {
    'worklet';
    // Draw exposure guide on the graded frame — burned into video
    frame.canvas.drawRect(Skia.XYWHRect(x, y, w, h), paint);
  },
});

Display-only via React: Buffer data and the current frame are exposed as Reanimated shared values. You can draw overlays in React that appear on screen but aren't part of the recorded video:

const { currentFrame, buffers } = useGPUFrameProcessor(camera, {
  resources: { lut: lutResource },
  pipeline: (frame, res) => {
    'worklet';
    frame.runShader(LUT_WGSL, { inputs: { lut: res.lut } });
    const hist = frame.runShader(HISTOGRAM_WGSL, { output: Uint32Array, count: 256 });
    return { hist };
  },
});

// Histogram overlay in React — visible on screen, NOT in recorded video
const histPicture = useDerivedValue(() => {
  const hist = buffers.value.hist as Uint32Array | null;
  if (!hist) return emptyPicture;
  return createPicture((canvas) => { /* draw bars */ });
});

return (
  <Canvas>
    <SkImage image={currentFrame} />
    <Picture picture={histPicture} />  {/* display-only overlay */}
  </Canvas>
);

This distinction matters for professional video work. A colorist wants to see scopes and guides while grading, but doesn't want them burned into the export. A filmmaker wants focus peaking during capture but clean footage in the final file. Same pipeline, same data, two output paths — the user chooses per element.

The canvas flicker fix from earlier (separate canvasTex + split mutex) is what makes the burn-in path reliable. Without it, onFrame draws would intermittently vanish when the compute pipeline overwrote the shared texture.

The Full Apple Log Pipeline

Camera (10-bit YUV, Apple Log)
  → IOSurface import as R10X6BG10X6Biplanar420Unorm
  → Y plane view (R16Unorm) + UV plane view (RG16Unorm)
  → Pass 0: YUV→RGB (BT.2020, video range, no clamp)
  → RGBA16Float ping-pong texture
  → Pass 1: LUT application (3D texture sample)
  → RGBA16Float output
  → Skia Graphite MakeImageFromTexture
  → SkImage → Canvas

Every arrow is a GPU-side operation or a metadata bind. Zero pixel copies from camera to display. From JS, enabling Apple Log is one line:

const camera = useCamera({
  device: 'back',
  colorSpace: 'appleLog',
});

What I Learned

The debugging stories have a common thread: the bug is never where you think it is.

The 33MB memcpy wasn't a GPU problem. The histogram flicker wasn't a shader problem. The empty SkImage wasn't a threading problem. In every case, the first theory was wrong, and the fix was simpler than the theory.

A few specific takeaways:

Profile the CPU before debugging the GPU. GPU work was fast. The pipeline was slow because of a dead memcpy in the camera delegate.

Never make two changes at once in a GPU pipeline. GPU errors are silent — no exceptions, just wrong pixels. When I changed SkImage creation timing and added an RGBA view format override simultaneously, I couldn't tell which one broke rendering. Isolate variables.

Keep a working reference. The single-pass pipeline that worked perfectly was the key to debugging the multi-pass rewrite. Every time something broke, I could diff against it.

Timing instrumentation pays for itself immediately. Adding per-step microsecond timing to the native pipeline (lock → import → bind → compute → buffers → makeImage) found the 33MB memcpy on the first run and has been useful for every optimization since.

What's Next

Performance profiling with Apple Log — how much does YUV→RGB + LUT add to GPU time?
ONNX Runtime WebGPU EP — ML inference on the same Dawn device, using Apple Log's wider dynamic range for better feature extraction
Android — AHardwareBuffer + SharedTextureMemory, same architecture, different platform primitives
Recording — render the LUT-graded output to an AVAssetWriter surface

The pipeline code is at react-native-webgpu-camera. Still spike phase — the API will change — but the architecture is solid.

This is Part 2 of an ongoing series. Part 1 covers the initial spike: zero-copy IOSurface import, multi-pass compute shaders, buffer readback, and the Skia Graphite setup gauntlet.

DEV Community