<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Zden</title>
    <description>The latest articles on DEV Community by Zden (@capstudioq).</description>
    <link>https://dev.to/capstudioq</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4000058%2Fcd991bfe-0c87-4953-991f-073ff3d46fb3.png</url>
      <title>DEV Community: Zden</title>
      <link>https://dev.to/capstudioq</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/capstudioq"/>
    <language>en</language>
    <item>
      <title>I built a caption editor that runs 100% in the browser - Whisper on WebGPU, MP4 export with WebCodecs, no server</title>
      <dc:creator>Zden</dc:creator>
      <pubDate>Wed, 24 Jun 2026 08:01:57 +0000</pubDate>
      <link>https://dev.to/capstudioq/i-built-a-caption-editor-that-runs-100-in-the-browser-whisper-on-webgpu-mp4-export-with-1e4k</link>
      <guid>https://dev.to/capstudioq/i-built-a-caption-editor-that-runs-100-in-the-browser-whisper-on-webgpu-mp4-export-with-1e4k</guid>
      <description>&lt;p&gt;Every "add captions to your short" tool works the same way: you upload your clip to their servers, they transcribe and render it in the cloud, and they meter your exports. That means an upload wait, a queue, file-size caps, a per-export bill, and your footage sitting on someone else's disk.&lt;/p&gt;

&lt;p&gt;I wanted to know if you could do the whole thing in the browser instead. Turns out you can, and the result (CapStudio) has a strange property for a video tool: it costs me almost nothing to run, because there is no render farm and no transcription API. The only server is auth, billing, and syncing a tiny project file. That is the entire reason one person can run it.&lt;/p&gt;

&lt;p&gt;Here is how the pieces fit together.&lt;/p&gt;

&lt;p&gt;Transcription: Whisper on WebGPU, in a tab&lt;br&gt;
Transcription runs locally with @huggingface/transformers (transformers.js v4), which can execute Whisper on WebGPU. The clip's audio is decoded to a mono 16kHz Float32 buffer with decodeAudioData + an OfflineAudioContext, then fed to the pipeline.&lt;/p&gt;

&lt;p&gt;Two things bit me here:&lt;/p&gt;

&lt;p&gt;You need word-level timestamps, which not every model can emit. Asking for return_timestamps: "word" throws on the default Whisper export ("Model outputs must contain cross attentions"). The fix is to use a _timestamped model export, which carries the cross-attention outputs. Rule of thumb: for word-timed captions, the model id must end in _timestamped.&lt;/p&gt;

&lt;p&gt;navigator.gpu existing does not mean WebGPU works. On plenty of machines (hardware acceleration off, blocklisted GPU, RDP/VM) navigator.gpu is present but requestAdapter() returns null. transformers.js only checks for the object, tries WebGPU, fails, and then poisons the WASM fallback so even device: "wasm" dies. The fix is to actually call requestAdapter() yourself first and only choose WebGPU if you get a truthy adapter, otherwise go straight to a clean WASM-only path. I also added a stall watchdog: if WebGPU downloads the model but makes no progress for 45s, reject and fall back.&lt;/p&gt;

&lt;p&gt;Rendering: one draw path for preview and export&lt;br&gt;
Each caption style (karaoke highlight, word pop, clean lower-third, and so on) is a pure function: layout(StyleContext) -&amp;gt; CaptionLayout. A single painter turns that layout into canvas draw calls, and a single drawCaptionFrame is the only entry point used by BOTH the live preview (a  over a ) and the export. That is what makes "what you see is what you export" literally true. I proved it with a pixel-diff harness that draws the same frame to a DOM canvas and an OffscreenCanvas: 0 mismatching channels.&lt;/p&gt;

&lt;p&gt;Adding a new style is one new module plus one registry line, with zero engine changes.&lt;/p&gt;

&lt;p&gt;Export: WebCodecs frames + an audio remux&lt;br&gt;
Export draws every frame with the same drawCaptionFrame, encodes it with a VideoEncoder (WebCodecs), and muxes it into MP4 with mp4-muxer, copying the original audio track through.&lt;/p&gt;

&lt;p&gt;Gotchas:&lt;/p&gt;

&lt;p&gt;B-frames make the first chunk's DTS non-zero, which the muxer rejects. Set firstTimestampBehavior: "offset".&lt;br&gt;
No backpressure = a long clip kills the tab. Feeding every sample to the decoder/encoder with no throttle floods the queues and flush() stalls around the 50% mark on a ~70s clip. The fix is to pause the loop while decodeQueueSize/encodeQueueSize &amp;gt; 16 and yield so the codec callbacks can drain. Short clips never showed this, which is why it shipped latent.&lt;br&gt;
Persistence: local-first, video stays on your machine&lt;br&gt;
Projects autosave to OPFS (createWritable() commits atomically on close(), so write video bytes first, then the manifest). Signed-in Pro users also get cloud sync, but only the project JSON syncs (transcript, style, config), never the video bytes. The video never leaves the device, by design.&lt;/p&gt;

&lt;p&gt;Why bother&lt;br&gt;
The architecture collapses operating cost to near zero, which is the whole point: no ASR bill, no GPU render farm, no per-minute pricing pressure. It also means your footage is private and there are no export limits, because the limits would only ever be someone else's server cost. The wedge I am chasing on top of that is strong Czech and Slavic support, where the English-first incumbents are weak.&lt;/p&gt;

&lt;p&gt;Limitations, honestly: it needs Chrome or Edge; without a working WebGPU adapter transcription falls back to WASM, which is correct but slow, and the first run downloads the Whisper model. It is in beta.&lt;/p&gt;

&lt;p&gt;If you want to try it, it is free (with a watermark) at &lt;a href="https://capstudio.xyz" rel="noopener noreferrer"&gt;https://capstudio.xyz&lt;/a&gt; - no signup to start, nothing uploaded.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>python</category>
    </item>
  </channel>
</rss>
