DEV Community

zephyr zheng
zephyr zheng

Posted on • Originally published at telegra.ph

The Unit Economics of Speech-to-Text Just Collapsed

The unit economics of speech-to-text just collapsed. Cloud ASR pricing is a leftover from when inference required someone else's GPU. It doesn't.

Run the numbers on current public rate cards. OpenAI's Whisper endpoint still bills $0.006 per minute ($0.36/hr) on standard usage (OpenAI docs). Deepgram's pricing page lists Nova-3 at $0.0077/min monolingual and $0.0092/min multilingual on Pay-As-You-Go, dropping to $0.0065 and $0.0078 on their Growth tier. Those numbers aren't high on an absolute basis. They're high relative to the marginal cost of running the same model locally, which rounded down to zero sometime in late 2024.

What Actually Shipped

Look at what arrived between mid-2023 and mid-2025. Gandhi et al.'s Distil-Whisper (2023) distilled large-v2 into a 756M-param student that runs 6× faster with a 1% WER gap on out-of-distribution audio, using large-scale pseudo-labelling. Georgi Gerganov's whisper.cpp made CPU-only and mobile inference a default rather than a party trick; a base.en checkpoint transcribes real-time on an M1 without touching a GPU. Max Bain's WhisperX added forced-alignment and diarization on top, so word-level timestamps and speaker labels stopped being a premium-tier differentiator.

Then WebGPU landed in stable Chromium, and the browser became a viable inference target. The last six-minute YouTube pull I ran finished in 43 seconds on a 2021 MacBook with the tab open — no upload, no key, no minute meter ticking. I built this browser-native transcriber partly to see where the ceiling actually is. It's higher than I expected.

Benchmark-wise, the gap has also closed. The Hugging Face Open ASR Leaderboard shows open-weight checkpoints clustering with proprietary endpoints on LibriSpeech, TED-LIUM, and multilingual FLEURS splits, with the top open entries beating some closed APIs on real-world noisy audio. Mistral's Voxtral technical report (July 2025) argues that speech-LLMs trained on the same web-scale regime as the original Whisper paper now match or surpass it while also handling instruction-following. None of this requires a vendor.

Why the Rate Cards Haven't Moved

Compute cost, bandwidth, R&D amortization, SLA overhead — all of that still costs money to build, but the marginal minute of audio no longer does, once the model is on a device the user already owns. This is the same economic shape as cloud-hosted IDEs when local VS Code plus containers caught up: the thing being sold is still real work, but the marginal-minute framing stops mapping to reality. It's also what happened to server-side OCR once Tesseract.js and the Shape Detection API made in-page text extraction a browser primitive.

Charging $0.006/min for a model anyone can run on their laptop is a durable business only as long as the buyer doesn't know, or the integration cost exceeds the savings. For dev teams moving more than a few thousand hours a year through an ASR pipeline, the integration cost is now an afternoon — pick a quantized checkpoint, wire in WhisperX for diarization, ship. Simon Willison's Whisper notes catalogue three years of people discovering exactly this, usually with mild surprise.

The closed vendors aren't wrong to still charge. A companion free-tools page exists because the natural baseline for basic transcription is now the browser, and the rate card should reflect that.

Top comments (0)