Mathis

Posted on Jun 19

Building a GPU-accelerated screen recorder in Rust with wgpu

#rust #gpu #wgpu #showdev

I record a lot of product demos. Screen Studio on macOS made "smoothly zoom toward the cursor, then glide back" feel effortless. On Linux, where I live, nothing did it well. So I built the zoom engine myself in Rust, on top of wgpu. It now powers Screenix, a polished screen recorder for Linux (here is how it compares to Screen Studio).

This post is the full writeup of the real engine, not a toy example. If you have ever wanted to do real-time GPU image processing in Rust, or you care about how to keep a live preview and a final export pixel-matched, the plumbing alone should save you a weekend.

First, the workflow reframe

My first version was the obvious one: hold a hotkey during recording, the recorder zooms live, release and it glides back. I shipped it. I then threw it out.

Even Screen Studio does not do that anymore. The problem is that "live" means the zoom decision is locked in at capture time. If the zoom follows the cursor a little too aggressively, or transitions out 100ms too early, your only fix is to re-record. That is miserable for a three-minute demo with one bad moment.

The workflow that actually wins is non-destructive timeline editing:

Record raw. Capture the screen and a sidecar of cursor positions, clicks, and timestamps. No zoom baked in.
Edit. In the editor, drop zoom "clips" onto a timeline. Each clip says "from second 4 to second 9, zoom 2x and follow the cursor." Drag the edges, change the scale, undo, redo, none of it touches the recording.
Preview. The editor renders the zoom live as you scrub, so you see exactly what you will get.
Export. A GPU bake pass renders the final pixels from the same clip data.

This means the zoom engine has to live in three places at once, and they all have to agree:

a TypeScript engine that powers the live preview on a <canvas>
a WGSL fragment shader that bakes the final pixels on the GPU
the Rust glue that turns timeline clips into per-frame GPU parameters

The interesting engineering is keeping those three in parity. The interesting math is that they all share one core idea.

The one idea that makes everything work

Zoom is not a rescale. It is a crop-window sample.

When you zoom in on a frame, you are not generating pixels. You are picking a smaller rectangular region of the source and drawing it to the full output. The sampler interpolates between source texels for free, so "zoom" collapses to one line of math.

On the preview canvas it is literally one drawImage call with a computed source rect:

const srcW = sourceWidth / scale;
const srcH = sourceHeight / scale;
const sx = clamp(centerX * sourceWidth - srcW / 2, 0, sourceWidth - srcW);
const sy = clamp(centerY * sourceHeight - srcH / 2, 0, sourceHeight - srcH);
ctx.drawImage(video, sx, sy, srcW, srcH, 0, 0, sourceWidth, sourceHeight);

On the GPU it is a UV remap in the fragment shader:

let uv = frag_pos.xy / resolution;
let zoom = vec2<f32>(zoom_level, zoom_level_y);
let centered = uv - vec2<f32>(0.5);
let zoomed   = centered / zoom;
let final_uv = bounded_center + zoomed;
let color = textureSample(screen_texture, screen_sampler, final_uv);

Same transform, two languages. Pick a window centered on center, sample it into the output, let the sampler (or the canvas scaler) do the smoothing. No rescale kernel, no intermediate buffer.

The trick that took the longest: bounded center

This is the part I am most happy with, and it caused the ugliest bugs.

When you zoom in and pan toward the edge, the obvious fix is to clamp the final UV to [0, 1] so you never sample outside the texture. That keeps the GPU happy but creates a horrible artifact: the moment the zoom window hits the edge, the border pixels stretch across the whole side of the output. The recording looks like it is melting.

The fix is subtle. Do not clamp the UV. Clamp the zoom center instead.

At zoom level z, the visible window is 1/z wide. Half of that is 0.5/z. So the center can roam freely inside [0.5/z, 1 - 0.5/z] and the window never leaves the source. If you clamp the center to that range, the final UV is mathematically guaranteed to stay in bounds, and there is nothing left to stretch.

At 2x zoom the window is half the frame, so the center is locked to [0.25, 0.75]. The view follows the cursor anywhere in the middle, but it physically cannot be panned far enough to expose out-of-bounds pixels.

Here is the reason this section matters for the whole architecture: this same invariant shows up in three files, in three languages, and they must all compute it identically.

In the WGSL shader:

let half_window = vec2<f32>(0.5) / zoom;
let bounded_center = clamp(uniforms.zoom_center, half_window, vec2<f32>(1.0) - half_window);

In the TypeScript preview engine:

function getFocusBoundsForScale(zoomScale: number) {
  const margin = 1 / (2 * zoomScale);
  return { min: margin, max: 1 - margin };
}

In the Rust export path, where per-frame center is derived from the crop top-left:

let cx = (pan_x / source_w as f64 + 0.5 / zoom) as f32;
let cy = (pan_y / source_h as f64 + 0.5 / zoom) as f32;

One invariant, enforced everywhere. When the preview shows a centered cursor, the export shows the same centered cursor, because the boundary math is identical. If you want a single takeaway from this post, it is this: when a value needs bounds, clamp the variable that causes the problem, not the one that displays the symptom.

The timeline model

A zoom effect is just a clip on a track. The data model is deliberately boring:

interface ZoomClip {
  start: number;          // seconds
  end: number;            // seconds
  scale: number;          // 1.0 to 20.0
  mode: ZoomMode;         // manual | follow | deadzone | auto
  centerX: number;        // 0..1 (manual mode)
  centerY: number;        // 0..1 (manual mode)
  transitionDuration: number;        // zoom-in seconds
  transitionOutDuration?: number;    // zoom-out seconds (can differ)
  transitionInCurve?: ZoomTransitionCurve;
  transitionOutCurve?: ZoomTransitionCurve;
  transitionPanCurve?: ZoomTransitionCurve;
  leadTime: number;       // how far ahead a follow clip looks
  deadZone: number;       // 0..1 dead-zone size
  springTension: number;  // 10..500, maps to spring constants
  cursorHidden?: boolean;
}

Every field is editable, every edit goes through an undo stack, the whole project serializes to a .screenix file. Nothing here is GPU-specific. It is a description of intent.

The four modes cover how the center is decided during the clip:

Manual (static_region): fixed centerX/centerY. You parked the zoom on a UI element and it stays there.
Follow (follow_mouse): center tracks the recorded cursor, spring-smoothed.
Deadzone (deadzone): center only moves when the cursor leaves a dead-zone in the middle of the frame. This is the "camera barely moves until you push toward the edge" feel.
Auto (auto_zoom): clips generated from click events. One button.

That last one is worth a paragraph.

Auto zoom from clicks

Most demos are "click here, click there, explain." So Auto reads the click sidecar and generates clips automatically: pad 300ms before each click and 2500ms after, merge overlapping ranges, fill tiny gaps, drop anything shorter than a second, and keep at least 4 seconds between zooms so it does not feel hyperactive. The clip center lands on the centroid of the nearby clicks.

return ranges.map((range) => {
  const groups = groupClicksInRange(clicks, range.start, range.end, zoomLevel);
  const center = groups[0] ? clickGroupCenter(groups[0]) : { x: 0.5, y: 0.5 };
  return { start: range.start, end: range.end, scale: zoomLevel,
           mode: "deadzone", centerX: center.x, centerY: center.y, /* ... */ };
});

You get a sane first cut in one click, then you nudge. This is the mode that makes the product feel like it did the boring part for you.

The live preview engine (TypeScript)

The preview has to feel instant. Scrub the timeline, and the frame should immediately show the right zoom. So there is a small engine that turns a videoTime plus the list of clips into a CameraState:

interface CameraState {
  scale: number;
  centerX: number;
  centerY: number;
  strength: number;   // 0..1, how "zoomed in" we are mid-transition
  isIdentity: boolean;
}

A clip is not a hard on/off. It has a strength envelope:

zoom-in: ramps 0 to 1 over transitionDuration at the start
hold: strength 1 across the clip
zoom-out: ramps 1 to 0 over transitionOutDuration after the clip ends

function computeRegionStrength(clip, timeSec) {
  if (timeSec < clip.start || timeSec > clip.end + zoomOutDur) return 0;
  const tRel = timeSec - clip.start;
  if (tRel < zoomInDur) return applyCurve(clip.transitionInCurve, tRel / zoomInDur);
  if (timeSec <= clip.end) return 1;
  return 1 - applyCurve(clip.transitionOutCurve, (timeSec - clip.end) / zoomOutDur);
}

The strength multiplies the scale, so the zoom eases in and out instead of snapping. And because zoom-in and zoom-out get separate durations and separate curves, you can have a soft 600ms ease in and a snappy 200ms ease out.

Curves, and stealing from the best

There are four curves: Linear, Smooth (ease-in-out cubic), Snap (ease-out cubic), and Soft. Soft is the one I care about. It is a deliberate copy of the feel of a certain macOS app, expressed as a CSS-style cubic bezier, cubic-bezier(0.16, 1, 0.3, 1):

export function easeOutScreenStudio(t: number): number {
  return cubicBezier(0.16, 1, 0.3, 1, t);
}

The catch is that CSS bezier control points are defined with x and y as separate values, and you have to solve for the parameter that gives you a target x. So cubicBezier runs a few Newton-Raphson iterations to get close, then a bisection pass to nail it:

function cubicBezier(x1, y1, x2, y2, t) {
  const targetX = clamp01(t);
  let solvedT = targetX;
  for (let i = 0; i < 8; i++) {                 // Newton-Raphson
    const cx = sampleCubicBezier(x1, x2, solvedT) - targetX;
    const d  = sampleCubicBezierDerivative(x1, x2, solvedT);
    if (Math.abs(cx) < 1e-6 || Math.abs(d) < 1e-6) break;
    solvedT -= cx / d;
  }
  // ... a few bisection steps to polish ...
  return sampleCubicBezier(y1, y2, solvedT);
}

Eight Newton iterations plus a bisection safety net is overkill for a 1D easing curve, but it runs once per frame and it is exact. You do not guess the feel, you reproduce it.

The deadzone follow, and why springs matter

Follow and Deadzone modes chase the recorded cursor. The naive version pins the center on the cursor every frame. That looks terrible: the view lurches with every micro-jitter.

Deadzone fixes the lurch. The center only updates when the cursor crosses out of a central dead-zone. Between those moments, the camera holds still. When it does move, a spring carries it.

The spring is where most of the "feel" lives. Cursor samples from the OS arrive at irregular intervals, sometimes with big gaps. Linear interpolation between sparse samples looks robotic. A spring feels human. And critically, the spring has to be stable for any timestep, because scrubbing the timeline can hand the engine a huge dt.

The preview uses a semi-implicit Euler integrator with a fixed sub-step, so it stays stable even when the frame interval spikes:

function stepSpring(pos, vel, target, dtSec, config) {
  const SUB_STEP = 1 / 240;
  let remaining = dtSec;
  let p = pos; let v = vel;
  while (remaining > 0) {
    const h = Math.min(remaining, SUB_STEP);
    const force = -config.stiffness * (p - target) - config.damping * v;
    v += (force / config.mass) * h;
    p += v * h;
    remaining -= h;
  }
  return [p, v];
}

A 1/240s sub-step means even a hitching 20fps preview integrates the spring 12 times per rendered frame, so the motion is smooth regardless of display rate.

The bake path: GPU, with a soft landing

Previewing is cheap because the source video is already decoding on the page. Export is different: you are decoding every frame, transforming it, and re-encoding. That is where the GPU earns its keep.

The export pipeline is three stages:

Compute per-frame params. The same clip data the preview uses gets flattened into an array of GpuZoomParams, one per output frame.

   struct GpuZoomParams {
       zoom: f32,
       zoom_y: f32,        // independent Y for fill/reframe crops
       center_x: f32,
       center_y: f32,
   }

Decode on the CPU, zoom on the GPU, encode on the CPU. FFmpeg decodes the source to RGBA and pipes it in. The Rust side uploads each frame to a texture, runs the WGSL pass, reads it back, and writes it to the encoder's stdin.

   ffmpeg decode -> RGBA pipe -> upload_screen -> GPU zoom pass -> readback -> ffmpeg encode

Fall back to FFmpeg filters if the GPU is unusable. On a machine with a broken driver or no GPU, the same per-frame params drive an FFmpeg crop + scale chain instead. Slower, identical output.

That fallback is not a nice-to-have on Linux, it is survival. You cannot assume Vulkan, you cannot assume a discrete GPU, you cannot assume a GPU at all. So the adapter search cascades through every backend wgpu knows about before giving up:

async fn find_adapter() -> Result<wgpu::Adapter> {
    if let Some(a) = Self::try_backend(wgpu::Backends::all(), false).await { return Ok(a); }
    if let Some(a) = Self::try_backend(wgpu::Backends::VULKAN, false).await { return Ok(a); }
    if let Some(a) = Self::try_backend(wgpu::Backends::GL, false).await { return Ok(a); }
    if let Some(a) = Self::try_backend(wgpu::Backends::all(), true).await { return Ok(a); } // software
    Err(anyhow!("No wgpu adapter found"))
}

And because a broken driver can make even new() panic, GPU init is wrapped in catch_unwind. If it fails for any reason, the recorder drops to the FFmpeg path. The user never sees a crash, the export just finishes.

The shader, piece by piece

The bake shader is small and worth reading because the choices are load-bearing.

No vertex buffer. The vertex shader generates a full-screen quad from @builtin(vertex_index). Six hardcoded corners, clip space [-1, 1], nothing uploaded per frame.

@vertex
fn vs_main(@builtin(vertex_index) vertex_index: u32) -> @builtin(position) vec4<f32> {
    var pos = array<vec2<f32>, 6>(
        vec2<f32>(-1.0, -1.0), vec2<f32>(1.0, -1.0), vec2<f32>(-1.0, 1.0),
        vec2<f32>(-1.0,  1.0), vec2<f32>(1.0, -1.0), vec2<f32>(1.0,  1.0)
    );
    return vec4<f32>(pos[vertex_index], 0.0, 1.0);
}

Bilinear filtering does the work. The sampler is the quiet hero:

let sampler = device.create_sampler(&wgpu::SamplerDescriptor {
    address_mode_u: wgpu::AddressMode::ClampToEdge,
    address_mode_v: wgpu::AddressMode::ClampToEdge,
    mag_filter: wgpu::FilterMode::Linear,
    min_filter: wgpu::FilterMode::Linear,
    ..Default::default()
});

Linear on both axes is what makes the zoom look smooth instead of blocky. The texture unit interpolates between source texels, so a 4x zoom into a small crop still looks clean.

BGR swap for free. Some capture backends deliver frames in BGRx. On the CPU, swapping channels for a 1080p frame is a per-pixel loop. In the shader it is one branch, run in parallel across millions of pixels:

let color = textureSample(screen_texture, screen_sampler, final_uv);
if (uniforms.swap_bgr > 0.5) {
    return vec4<f32>(color.b, color.g, color.r, color.a);
}
return color;

A runtime uniform flag decides whether to swap, so X11 (RGBx) and Wayland (BGRx) share one shader.

Independent X and Y zoom. The uniform carries both zoom_level and zoom_level_y. For cursor zoom they are equal, so the zoom is square. But turning a 16:9 recording into a 9:16 social cut needs a window whose aspect differs from the source, so the two axes zoom by different amounts. One shader, two jobs.

The readback plumbing nobody writes about

The shader is the fun 5%. The other 95% is getting bytes onto the GPU and back, on whatever Linux machine the user has.

Triple buffering. The controller keeps three source textures, three readback buffers, and three bind groups, selected by a rotating index. While the GPU renders frame N into slot 0, the CPU can map and read slot 2 from frame N minus 2. Recording never stalls waiting for a buffer.

The 256-byte alignment trap. copy_texture_to_buffer requires bytes_per_row to be aligned to 256 bytes. A 1920-wide RGBA row is 7680 bytes, which happens to be aligned. A 1366-wide row is 5464 bytes, which is not. So the padded row size is computed explicitly:

let bytes_per_row = ((width * 4 + 255) / 256) * 256;

When you map that buffer back, there is padding at the end of every row. You have to un-pad it row by row into the output buffer, or the encoder gets a diagonally sheared frame:

for row in 0..self.height as usize {
    let src_start = row * bytes_per_row as usize;
    let dst_start = row * row_bytes;
    out_rgba[dst_start..dst_start + row_bytes]
        .copy_from_slice(&mapped[src_start..src_start + row_bytes]);
}

Then device.poll(WaitForSubmissionIndex(...)) blocks until the GPU is actually done, and you unmap. Forget the poll and you get last frame's pixels. Forget the unpad and the export looks like glitch art. Forget the unmap and you leak the buffer.

Keeping the three renderers in parity

The hardest bug class in this project is "preview looks right, export is off by a few pixels." When you have one math idea expressed in TypeScript, WGSL, and Rust, they drift.

The defenses I landed on:

One invariant, copied verbatim. The bounded-center margin 0.5 / zoom appears in all three. When I changed it once, I grepped for 0.5 / and 1 / (2 * everywhere.
Derive the export from the same clip data. The export does not re-decide where to zoom. It receives the same ZoomClips and cursor sidecar the preview uses, and deterministically replays them. There is no second source of truth.
Float arrays over integer rects. When deriving the GPU center from a crop window, the export prefers continuous pan_x/pan_y float arrays over integer-rounded frame rects, because rounding the rect each frame introduces sub-pixel jitter that the preview never had.
Honest fallback. The FFmpeg fallback path uses the same per-frame params, so a GPU-less machine gets the same framing, just rendered slower.

What I learned

A few things I wish I had internalized on day one:

Bake, do not burn. Recording zoom live feels clever until a user needs to fix one transition. Non-destructive timeline effects cost more to build and pay off forever. Match the modern workflow, even if the live version is less code.
Reach for the right primitive. "Zoom" felt like a scaling problem. It was a sampling problem. Reframing it as a crop-window sample made the expensive part free, in both a canvas drawImage and a shader.
Clamp the cause, not the symptom. Clamping the UV melted the edge of the frame. Clamping the center removed the whole class of artifacts. When a clamp produces artifacts, you are usually clamping the wrong variable.
One invariant, three languages. When the same rule lives in TS, WGSL, and Rust, copy it verbatim and grep for it. Abstraction across a language boundary costs more than the duplication.
Plan for the GPU to be absent. Every GPU path has an FFmpeg fallback. The export works on a headless VM with a software adapter. That is not a feature, it is a survival strategy on Linux.
The boring plumbing is the project. The shader is 99 lines. The controller hosting it is 800. The parity discipline around it is the real work. Budget accordingly.

The code

This is a real product, and the zoom engine is the soul of it. If you want a polished screen recorder for Linux without leaving your operating system, give Screenix a try (or read the Screen Studio alternative breakdown if you are coming from macOS).

If you want to build your own GPU pipeline in Rust, wgpu is hard to beat for headless, cross-backend work. Start with the bounded-center trick, a full-screen triangle, and a deterministic per-frame param array. The rest is readback.

If you found this useful, I write about Rust, graphics, and Linux desktop internals as I ship this stuff. Questions and roasts welcome in the comments.

DEV Community