How I built a private audio transcription tool in browser using Transformers.js

Fawwaaz Sheik — Fri, 08 May 2026 15:59:34 +0000

So my dad needed to transcribe an interview. Simple enough right? Except he refused to upload his voice to any cloud service which honestly makes total sense. I went looking for local options and everything required installing Python, managing dependencies, running terminal commands. An hour of setup minimum. Not happening.

So instead of doing the setup I just built it. Took about 5 hours. Here's how the whole thing works under the hood.

The core architecture

The fundamental insight is that you can run Whisper which is the same model powering most cloud transcription services directly in the browser using WebAssembly. No server needed. Transformers.js by Hugging Face handles all the heavy lifting: model downloading, caching, ONNX inference, and audio chunking.

Here's the high level flow:

Why a Web Worker is non-negotiable

This is the most important architectural decision. Whisper is computationally heavy. If you run it on the main thread your entire UI freezes so then theres no progress bar updates, no animations, nothing. The tab locks until transcription finishes.

A Web Worker runs on a completely separate thread. Your React UI stays alive and responsive while Whisper churns through the audio in the background. They communicate via postMessage where the worker fires updates after each 30 second chunk and React renders them as they arrive.

The two-thread model looks like this:

The gotcha nobody tells you about: audio decoding

Whisper doesn't understand MP3 or WAV files directly. It needs raw audio samples, specifically 16,000 samples per second (16kHz mono) as a Float32Array. So before you even touch Transformers.js you have to decode the file using the browser's AudioContext API and resample it.

This is the step that trips everyone up. The worker receives the raw file, not decoded audio. You decode it in the main thread using AudioContext, resample to 16kHz, then pass the Float32Array to the worker. Miss this step and Whisper just produces garbage.

WebGPU: promising but not ready to hardcode

I spent a while trying to get WebGPU acceleration working. The results were all over the place — Chrome gave me a speedup, Firefox hung for 200 seconds, my own Zen Browser (firefox-based) silently fell back to WASM.

My conclusion: let Transformers.js auto-detect. Don't hardcode device: 'webgpu'. On Apple Silicon Macs the unified memory architecture means CPU and GPU share the same pool, so the automatic mode actually balances work across both and ends up faster than forcing either one explicitly. Hardcoding loses you that optimization.

Real world speed numbers

On a Mac M3 via WASM with the Xenova models, for a 24 second test clip:

Tiny model: ~2700ms — roughly 7 seconds per minute of audio
Base model: ~8000ms — roughly 20 seconds per minute of audio

Small model: ~35000ms — roughly 90 seconds per minute, slower than realtime

Tiny is the right default. Most people transcribing interviews or lectures don't need the accuracy difference between tiny and base. Let them upgrade if they want.

The model download UX problem

The tiny model is ~40MB on first load. On slow connections that can take a minute. Without clear feedback users think it's broken and close the tab. I added a progress bar with the exact MB count and fun facts that rotate every few seconds during the download. The fun facts are dumb but they work so people stay.

The key message to show: "This only downloads once future visits are instant." That reframes a one-time annoyance as a feature.

What I'd do differently

Skip building the UI before proving the worker works. I spent time on design before the core was solid. The right order is: get the worker transcribing a test file in the console, build the hook that bridges React and the worker, then build UI around that working foundation.

Also test on Windows Chrome early. My dev setup is Mac with Zen Browser and I hit WebGPU issues I wouldn't have caught without testing cross-platform.

What's next

Speaker diarization is the most requested feature by far. I noticed every competitor comment thread is full of people wishing for "Speaker A / Speaker B" labels. It's not possible cleanly in the browser yet since pyannote (the standard diarization library) hasn't been ported to Transformers.js, but that's coming but maybe theres an alternative stack we can apply right now.

Language selection is a quick win, Whisper supports 99 languages natively, just needs a UI selector exposed.

AI post-processing is the monetization path sending the transcript text (not audio) to an LLM for cleanup, filler word removal, and summaries. Privacy story stays intact since it's just text.

You can try it at usewhispy.com. Free, no account, works offline after first load.

DEV Community: Fawwaaz Sheik

How I built a private audio transcription tool in browser using Transformers.js