I built a caption editor that runs 100% in the browser - Whisper on WebGPU, MP4 export with WebCodecs, no server

Zden — Wed, 24 Jun 2026 08:01:57 +0000

Every "add captions to your short" tool works the same way: you upload your clip to their servers, they transcribe and render it in the cloud, and they meter your exports. That means an upload wait, a queue, file-size caps, a per-export bill, and your footage sitting on someone else's disk.

I wanted to know if you could do the whole thing in the browser instead. Turns out you can, and the result (CapStudio) has a strange property for a video tool: it costs me almost nothing to run, because there is no render farm and no transcription API. The only server is auth, billing, and syncing a tiny project file. That is the entire reason one person can run it.

Here is how the pieces fit together.

Transcription: Whisper on WebGPU, in a tab
Transcription runs locally with @huggingface/transformers (transformers.js v4), which can execute Whisper on WebGPU. The clip's audio is decoded to a mono 16kHz Float32 buffer with decodeAudioData + an OfflineAudioContext, then fed to the pipeline.

Two things bit me here:

You need word-level timestamps, which not every model can emit. Asking for return_timestamps: "word" throws on the default Whisper export ("Model outputs must contain cross attentions"). The fix is to use a _timestamped model export, which carries the cross-attention outputs. Rule of thumb: for word-timed captions, the model id must end in _timestamped.

navigator.gpu existing does not mean WebGPU works. On plenty of machines (hardware acceleration off, blocklisted GPU, RDP/VM) navigator.gpu is present but requestAdapter() returns null. transformers.js only checks for the object, tries WebGPU, fails, and then poisons the WASM fallback so even device: "wasm" dies. The fix is to actually call requestAdapter() yourself first and only choose WebGPU if you get a truthy adapter, otherwise go straight to a clean WASM-only path. I also added a stall watchdog: if WebGPU downloads the model but makes no progress for 45s, reject and fall back.

Rendering: one draw path for preview and export
Each caption style (karaoke highlight, word pop, clean lower-third, and so on) is a pure function: layout(StyleContext) -> CaptionLayout. A single painter turns that layout into canvas draw calls, and a single drawCaptionFrame is the only entry point used by BOTH the live preview (a over a ) and the export. That is what makes "what you see is what you export" literally true. I proved it with a pixel-diff harness that draws the same frame to a DOM canvas and an OffscreenCanvas: 0 mismatching channels.

Adding a new style is one new module plus one registry line, with zero engine changes.

Export: WebCodecs frames + an audio remux
Export draws every frame with the same drawCaptionFrame, encodes it with a VideoEncoder (WebCodecs), and muxes it into MP4 with mp4-muxer, copying the original audio track through.

Gotchas:

B-frames make the first chunk's DTS non-zero, which the muxer rejects. Set firstTimestampBehavior: "offset".
No backpressure = a long clip kills the tab. Feeding every sample to the decoder/encoder with no throttle floods the queues and flush() stalls around the 50% mark on a ~70s clip. The fix is to pause the loop while decodeQueueSize/encodeQueueSize > 16 and yield so the codec callbacks can drain. Short clips never showed this, which is why it shipped latent.
Persistence: local-first, video stays on your machine
Projects autosave to OPFS (createWritable() commits atomically on close(), so write video bytes first, then the manifest). Signed-in Pro users also get cloud sync, but only the project JSON syncs (transcript, style, config), never the video bytes. The video never leaves the device, by design.

Why bother
The architecture collapses operating cost to near zero, which is the whole point: no ASR bill, no GPU render farm, no per-minute pricing pressure. It also means your footage is private and there are no export limits, because the limits would only ever be someone else's server cost. The wedge I am chasing on top of that is strong Czech and Slavic support, where the English-first incumbents are weak.

Limitations, honestly: it needs Chrome or Edge; without a working WebGPU adapter transcription falls back to WASM, which is correct but slow, and the first run downloads the Whisper model. It is in beta.

If you want to try it, it is free (with a watermark) at https://capstudio.xyz - no signup to start, nothing uploaded.

DEV Community: Zden

I built a caption editor that runs 100% in the browser - Whisper on WebGPU, MP4 export with WebCodecs, no server