DEV Community

Uchit Chakma
Uchit Chakma

Posted on

Building an AI Auto-Caption Tool for Videos with React + WebAssembly

Building an AI Auto-Caption Tool for Videos with React + WebAssembly

Caption generation is one of the most in-demand features in video editing today. Social media creators need captions for Instagram Reels, TikTok, YouTube Shorts, and more. Traditional server-based solutions require expensive GPU infrastructure for transcription and rendering. But what if you could do everything in the browser?

In this article, I will walk through the architecture of building a client-side auto-caption tool using React, WebAssembly, and modern browser APIs. The approach we used for Captionator keeps everything on the user's machine — no uploads, no server costs, and complete privacy.

Why Client-Side Processing?

Most video caption tools follow a server-based architecture. You upload a video, it gets processed on the backend, and you download the result. This works, but it has drawbacks:

  • Upload times for large video files
  • Server costs for GPU-backed transcription
  • Privacy concerns when handling sensitive footage
  • File size limitations

Modern browsers have evolved dramatically. With WebAssembly (Wasm), we can run computationally intensive tasks directly in the browser at near-native speed. Combined with the Web Audio API for audio extraction and Canvas API for video rendering, a fully client-side caption tool is not just possible — it is practical.

The Core Architecture

Our stack for Captionator looks like this:

Frontend: React with TypeScript
Audio Processing: Web Audio API + WebAssembly-ported Whisper model
Video Rendering: HTML5 Canvas + WebCodecs API
Styling: CSS-in-JS with custom animation engine

The audio transcription step is the hardest part. Speech-to-text models are typically large and require GPU acceleration. However, with quantised Whisper models compiled to WebAssembly via ONNX Runtime Web, we can run inference directly in the browser.

Step 1: Audio Extraction

Before we can transcribe, we need raw audio from the video file. The Web Audio API makes this straightforward:

const audioContext = new AudioContext();
const response = await fetch(videoFile);
const arrayBuffer = await response.arrayBuffer();
const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
Enter fullscreen mode Exit fullscreen mode

This gives us the raw PCM audio data, which we can feed into our transcription model.

Step 2: Running Transcription in the Browser

For the transcription engine, we use a quantised version of OpenAI's Whisper model compiled to WebAssembly. ONNX Runtime Web handles the inference:

import { InferenceSession } from 'onnxruntime-web';

const session = await InferenceSession.create(whisperModelPath);
const feeds = { input: new Tensor('float32', audioData, dimensions) };
const results = await session.run(feeds);
Enter fullscreen mode Exit fullscreen mode

The key optimisation here is model quantisation. A full Whisper model is around 1.5GB. By quantising to 8-bit integers and trimming unnecessary layers, we get it down to about 300MB while maintaining over 95% accuracy.

Step 3: Word-Level Timing

Raw transcription gives you text but not timing. For animated captions, we need word-level timestamps. This requires a forced alignment pass after the initial transcription.

The Wav2Vec2 alignment model, also compiled to WebAssembly, handles this. It maps each word to its exact position in the audio timeline. This is what enables those smooth, perfectly synced caption animations.

Step 4: Rendering Captions onto Video

This is where the magic happens. Using Canvas and WebCodecs, we decode the original video frame by frame and overlay captions:

const decoder = new VideoDecoder({
  output: frame => {
    ctx.drawImage(frame, 0, 0);
    renderCaptions(ctx, currentTime, captions);
    frame.close();
  }
});
Enter fullscreen mode Exit fullscreen mode

The caption renderer supports customisable styles — font, colour, animation, position. Each caption word gets its own animation properties: scale, opacity, and position offsets that create that viral "MrBeast-style" effect.

Step 5: Export Options

Users need different output formats. We support:

  1. MP4 with baked-in captions — using Canvas captureStream + MediaRecorder2. SRT file export — for professional editing workflows
  2. WebVTT export — for web video players

The SRT export is particularly useful for YouTubers who want to upload subtitle files separately.

Performance Considerations

Running Whisper in the browser is not trivial. Here are the key optimisations we made:

  • SIMD acceleration: WebAssembly SIMD instructions speed up matrix operations by 3-4x
  • Multi-threading: SharedArrayBuffer + Web Workers distribute the workload across CPU cores
  • Progressive processing: Transcription starts as soon as enough audio is buffered — no need to wait for the full extract

On a modern M-series Mac or a high-end Windows laptop, transcription completes in about 30% of the video duration. A 2-minute video transcribes in roughly 40 seconds.

What About Mobile?

Mobile browsers have less memory and slower CPUs. For mobile support, we fall back to a smaller distilled Whisper model (the "tiny" variant) that runs in under 100MB of RAM. Accuracy drops slightly but remains above 90%.

The Business Case

Building a client-side tool eliminates server costs entirely. For a startup or indie developer, this is massive. No GPU servers to rent, no bandwidth costs for video uploads, no storage for processed files.

This is exactly why we built Captionator the way we did. It processes everything locally, stays free to use, and handles videos of any length since there are no upload limits.

Future Improvements

We are exploring several enhancements:

  • GPU acceleration via WebGPU — for even faster transcription
  • Language detection — auto-detect and transcribe 50+ languages
  • Emoji insertion — automatically add emojis based on speech sentiment
  • Batch processing — caption multiple videos in sequence

Conclusion

Client-side video processing is not just a gimmick. It is a legitimate architecture choice that saves money, protects user privacy, and delivers a better experience. With modern WebAssembly, Canvas, and WebCodecs APIs, the browser has become a powerful media processing platform.

If you are building a video tool, consider this approach. Your users will thank you for the speed and privacy. And your AWS bill will thank you for the zero server costs.

Check out Captionator to see the architecture in action. The entire tool is open for testing — no signup needed, just upload and go.

Top comments (0)