You shipped the feature last Tuesday. Upload audio, hit transcribe, display text. By Friday your users were complaining about garbled timestamps, missing speaker labels, and a bill that made your CFO flinch. Raw API output isn't enough for production. You need a pipeline.
Most speech-to-text tutorials stop at curl. They don't cover audio preprocessing, model selection, or how to clean up the mess that comes back when three people talk over each other in a Zoom recording. This guide walks through what actually works.
How the sausage gets made
AI transcription isn't a black box you lob files into. It's a chain of decisions. Audio gets normalized, chunked, fed to an acoustic model, and reconstructed into text. Then a language model guesses punctuation and paragraphs. If you want speaker labels, a separate diarization model runs alongside it.
The pipeline looks like this:
- Audio input and format normalization
- Chunking and resampling
- Model inference (ASR)
- Post-processing (punctuation, formatting)
- Speaker diarization (optional but painful)
- Export and storage
Skip step 1 or 2 and you'll pay for step 3 twice.
Step 1: Fix your audio before it hits the API
Developers love to send whatever comes out of the browser's <input type="file"> straight to the cloud. Don't. APIs have preferences, and your users upload garbage.
Standardize on these specs:
- Format: Mono WAV or FLAC. Stereo confuses some models.
- Sample rate: 16 kHz or 24 kHz. Resample if you have to.
- Bitrate: 16-bit PCM. No 32-bit float surprises.
- Preprocessing: Normalize loudness to -16 LUFS. Strip silence longer than 2 seconds if your ASR bills by duration.
Use ffmpeg. It's ugly but it's everywhere.
ffmpeg -i input.m4a -ar 16000 -ac 1 -sample_fmt s16 output.wav
That one line fixes half the accuracy issues you'll see in production. It converts variable-bitrate user uploads into something the model actually expects.
Step 2: Pick the right engine for the job
Not all transcription APIs are the same. They optimize for different things. Here's the honest breakdown:
OpenAI Whisper (API or self-hosted)
- Great accuracy across languages
- Cheap on API, cheaper if you run
baseorsmalllocally on a CPU - No native speaker diarization
- Slower than cloud providers on long files
Google Cloud Speech-to-Text
- Excellent real-time streaming via gRPC
- Good speaker diarization (up to 8 speakers in some configs)
- Pricier, especially with premium models like
latest_long - Needs audio in specific containers (LINEAR16, MULAW, etc.)
AWS Transcribe
- Solid medical and call analytics variants
- Speaker partitioning works but lags behind Google
- Turnaround is batch-oriented; real-time exists but feels bolted on
Deepgram Nova
- Fast. Like, actually fast.
- Good at messy audio (background noise, accents)
- Speaker diarization is decent but costs extra tiers
For most apps, Whisper hits the sweet spot of cost and accuracy. If you need live captions during a WebRTC call, Google Cloud's streaming API is hard to beat.
Step 3: Code a resilient batch pipeline
Here's a minimal Python worker that handles the full flow. It preprocesses with ffmpeg, sends to an API, and structures the output. Swap in your provider of choice.
import subprocess
import json
import requests
from pathlib import Path
def normalize_audio(input_path: Path, output_path: Path) -> None:
subprocess.run([
"ffmpeg", "-y", "-i", str(input_path),
"-ar", "16000", "-ac", "1", "-sample_fmt", "s16",
str(output_path)
], check=True)
def transcribe_file(audio_path: Path) -> dict:
with open(audio_path, "rb") as f:
# Example using Whisper API; swap for Deepgram, AWS, etc.
response = requests.post(
"https://api.openai.com/v1/audio/transcriptions",
headers={"Authorization": f"Bearer {API_KEY}"},
files={"file": f},
data={
"model": "whisper-1",
"response_format": "verbose_json",
"timestamp_granularities[]": "word"
}
)
response.raise_for_status()
return response.json()
def process_upload(raw_path: Path) -> dict:
clean_path = raw_path.with_suffix(".wav")
normalize_audio(raw_path, clean_path)
result = transcribe_file(clean_path)
# Post-process: rebuild transcript with word-level timestamps
words = result.get("words", [])
segments = []
current_segment = {"start": words[0]["start"], "text": ""}
for word in words:
current_segment["text"] += word["word"] + " "
# New sentence heuristic
if word["word"].endswith((".", "!", "?")):
current_segment["end"] = word["end"]
segments.append(current_segment)
current_segment = {"start": None, "text": ""}
return {
"duration": result.get("duration"),
"segments": segments,
"raw": result.get("text")
}
Here's what actually matters in that snippet. We ask for verbose_json and word timestamps. That granularity lets you rebuild sentences cleanly instead of accepting a wall of text. We also normalize audio before upload. Don't let users foot the bill for weird codecs.
Step 4: Add speaker labels without losing your mind
Speaker diarization is still the hardest part of transcription. Most APIs that offer it charge more, and the accuracy drops when speakers interrupt each other.
If your provider supports it, enable it at the API level. If not, you'll need a separate model like pyannote.audio or AWS's channel-based routing.
A cheap heuristic that works for interviews: force single-channel audio and ask the API to partition speakers. If that fails, fall back to a secondary diarization pass on the normalized file.
# Pseudo-code for dual-pass pipeline
transcript = transcribe_file(clean_path)
diarization = run_pyannote(clean_path)
# Merge word timestamps with speaker segments
for word in transcript["words"]:
speaker = diarization.find_speaker_at(word["start"])
word["speaker"] = speaker
It's not perfect. You'll still need manual review for content that ships to customers. But it gets you 90% of the way there.
Step 5: Format for humans, not machines
Nobody wants a raw JSON dump. Your end users want paragraphs, timestamps they can click, and speaker names.
Structure your final output like this:
{
"segments": [
{
"speaker": "A",
"start": 0.0,
"end": 4.5,
"text": "The API returns words, but humans read sentences."
}
]
}
Export to SRT if you're building subtitles:
1
00:00:00,000 --> 00:00:04,500
The API returns words, but humans read sentences.
And always store the raw API response. When a user reports an error, you'll want to replay it without burning more credits.
The tradeoff framework you actually need
You'll face three knobs in production. Here's how to turn them.
Speed vs. accuracy
Fast modes use smaller models. Use them for search indexing and internal notes. Use best-quality models for customer-facing captions and compliance logs.
Cost vs. precision
Batch processing is cheaper per minute than real-time. If you don't need live captions, don't pay for streaming. Reserve premium engines (Google's latest_long, Nova-2) for your highest-value content.
Speaker labels vs. complexity
Don't enable diarization unless someone reads the labels. If it's just a giant blob of text for full-text search, skip it. You'll save money and processing time.
Common gotchas
- Timestamps drift on long files over 30 minutes. Chunk at 10-minute boundaries if your API allows it.
- Code switching (mixing languages in one file) breaks most monolingual models. Split by language if possible.
- Profanity filters in enterprise APIs will asterisk out words in medical or legal transcripts. Disable them if your provider lets you.
- WebRTC audio is often sampled at 48 kHz stereo. Downsample before sending.
Wrapping up
Building with AI transcription isn't hard. Building it so it doesn't break in production is. Preprocess your audio, pick an engine that matches your latency budget, and post-process the output into something readable. Treat the API like a component, not a magic wand.
Your users will thank you. Your wallet will too.
Top comments (0)