What It Takes to Build Real-Time AI Audio Transcription (Lessons from Studying Vomo)

#ai #machinelearning #whisper #webdev

I spent a weekend recently trying to build a small transcription feature for an internal tool, assuming it would be a wrapper around a Whisper API call. Two days later I had a new respect for products that do this well.

This post is a breakdown of what actually sits inside a production transcription pipeline, using VOMO AI as a reference point for what "done properly" looks like from the outside. I don't work there and have no inside knowledge of their stack. This is a study of the problem space, with their product as the benchmark I kept failing to match.

The naive version works until it doesn't

Here's the weekend-project architecture:

audio file → chunk into 30s segments → ASR API → concatenate text

This works on a clean podcast recording. It falls apart on real audio for reasons that are obvious in hindsight:

Chunk boundaries split words. Cut at an arbitrary 30-second mark and you'll bisect a word, producing garbage on both sides of the seam. You need voice activity detection (VAD) to find silence and cut there instead.
No punctuation, no casing. Raw ASR output is a lowercase stream. Restoring sentence boundaries is its own model pass.
Who said what? A meeting transcript without speaker labels is close to useless. Speaker diarization, clustering voice embeddings to assign segments to speakers, is a separate and genuinely hard problem. It degrades badly when people talk over each other.
Hallucination on silence. ASR models trained on captioned data will confidently emit text like "thanks for watching" during long silent stretches. You have to filter these.

Each of these is a solved problem individually. Stitching them into a pipeline that returns a clean result in minutes, for a 3-hour file, in 50+ languages, is the actual product.

The pipeline that production tools run

From observing output quality across several commercial tools, the modern stack looks roughly like this:

ingest → resample/normalize → VAD segmentation
      → ASR (per-segment, batched, GPU)
      → punctuation & casing restoration
      → speaker diarization (parallel path)
      → alignment (merge words + speakers + timestamps)
      → post-processing (LLM: summary, chapters, action items)

A few notes on the interesting parts:

Diarization runs parallel to ASR, not after it. Diarization only needs the audio, not the words. The merge step afterward aligns word timestamps with speaker segments, which is where you see errors like a sentence's last word attributed to the next speaker.

Long files are a throughput problem. A 3-hour meeting is ~10,800 seconds of audio. Sequential processing at even 10x real-time speed means 18 minutes of waiting. Production systems fan segments out across GPU workers and merge results, which is why Vomo can return a multi-hour file in minutes while my sequential script took most of an hour.

The LLM layer is where products differentiate now. Base transcription accuracy has converged; everyone respectable is in the 90s on clean audio (Vomo advertises 95%+). The gap has moved to what you do with the transcript. Vomo's approach is worth studying: it classifies the content type (meeting vs. interview vs. lecture), applies a matching template, and produces timestamped chapters, a summary, and action items in one pass. There's also a Q&A mode where you ask questions against the transcript and get answers grounded in the actual text. Retrieval over a single document, effectively.

The parts nobody blogs about

The unglamorous engineering that separates a demo from a product:

Format handling. Users upload MP3, WAV, M4A, FLAC, AAC, OGG, and then video too (MP4, MKV, MOV, AVI), expecting the audio track extracted silently. That's an ffmpeg layer with a long tail of weird codecs.
Language detection. Supporting 50+ languages means detecting the language before choosing decode parameters, and handling code-switching mid-recording gracefully.
Accents and noise. My test recordings with background noise produced wildly variable results in my naive pipeline. Commercial models are fine-tuned on augmented noisy data; this is a data moat more than an architecture trick.
Encryption and retention. Meeting audio is some of the most sensitive data a company produces. At minimum you need encryption in transit and at rest, user-controlled deletion, and a GDPR story. Vomo checks these boxes; any pipeline you build for real users has to as well.

Should you build or buy?

If transcription is your core product: build, obviously. If it's a feature (you want searchable meeting notes inside your app), the economics are brutal. GPU inference, diarization tuning, multilingual eval sets, and the LLM post-processing layer all cost real time to get right.

For personal or team use, the buy math is even simpler. Vomo's Pro tier is $1.92/week for unlimited transcription minutes; the free tier gives you 30 minutes a week to evaluate output quality on your own audio, which is the only benchmark that matters. My weekend project is now a folder of abandoned Python scripts, and honestly, that's the correct outcome.

If you've built diarization or streaming ASR in production, I'd genuinely like to hear what broke first. The comments are open.

Top comments (1)

elboKazQC • Jul 7

Great study. The "thanks for watching" hallucination on silence bit me too. What broke first for me was different though, since I build dictation (push-to-talk, short bursts) rather than long-file transcription: bounding the audio with a keypress makes most of the VAD and silence-hallucination class just disappear. No dead air fed to the model, no ghost captions. Code-switching is where I still bleed though, Quebec French with English dev terms in one sentence, and most models pick one language per segment instead of per word. Did punctuation restoration hold up across your 50+ languages, or degrade outside English?