Toshius Klay

Posted on Jul 3

Build a Reliable AI Transcription Pipeline: A Developer’s Field Guide

#speechtotext #ai #api #audioprocessing

You shipped the feature last Tuesday. Upload audio, hit transcribe, display text. By Friday your users were complaining about garbled timestamps, missing speaker labels, and a bill that made your CFO flinch. Raw API output isn't enough for production. You need a pipeline.

Most speech-to-text tutorials stop at curl. They don't cover audio preprocessing, model selection, or how to clean up the mess that comes back when three people talk over each other in a Zoom recording. This guide walks through what actually works.

How the sausage gets made

AI transcription isn't a black box you lob files into. It's a chain of decisions. Audio gets normalized, chunked, fed to an acoustic model, and reconstructed into text. Then a language model guesses punctuation and paragraphs. If you want speaker labels, a separate diarization model runs alongside it.

The pipeline looks like this:

Audio input and format normalization
Chunking and resampling
Model inference (ASR)
Post-processing (punctuation, formatting)
Speaker diarization (optional but painful)
Export and storage

Skip step 1 or 2 and you'll pay for step 3 twice.

Step 1: Fix your audio before it hits the API

Developers love to send whatever comes out of the browser's <input type="file"> straight to the cloud. Don't. APIs have preferences, and your users upload garbage.

Standardize on these specs:

Format: Mono WAV or FLAC. Stereo confuses some models.
Sample rate: 16 kHz or 24 kHz. Resample if you have to.
Bitrate: 16-bit PCM. No 32-bit float surprises.
Preprocessing: Normalize loudness to -16 LUFS. Strip silence longer than 2 seconds if your ASR bills by duration.

Use ffmpeg. It's ugly but it's everywhere.

ffmpeg -i input.m4a -ar 16000 -ac 1 -sample_fmt s16 output.wav

That one line fixes half the accuracy issues you'll see in production. It converts variable-bitrate user uploads into something the model actually expects.

Step 2: Pick the right engine for the job

Not all transcription APIs are the same. They optimize for different things. Here's the honest breakdown:

OpenAI Whisper (API or self-hosted)

Great accuracy across languages
Cheap on API, cheaper if you run base or small locally on a CPU
No native speaker diarization
Slower than cloud providers on long files

Google Cloud Speech-to-Text

Excellent real-time streaming via gRPC
Good speaker diarization (up to 8 speakers in some configs)
Pricier, especially with premium models like latest_long
Needs audio in specific containers (LINEAR16, MULAW, etc.)

AWS Transcribe

Solid medical and call analytics variants
Speaker partitioning works but lags behind Google
Turnaround is batch-oriented; real-time exists but feels bolted on

Deepgram Nova

Fast. Like, actually fast.
Good at messy audio (background noise, accents)
Speaker diarization is decent but costs extra tiers

For most apps, Whisper hits the sweet spot of cost and accuracy. If you need live captions during a WebRTC call, Google Cloud's streaming API is hard to beat.

Step 3: Code a resilient batch pipeline

Here's a minimal Python worker that handles the full flow. It preprocesses with ffmpeg, sends to an API, and structures the output. Swap in your provider of choice.

import subprocess
import json
import requests
from pathlib import Path

def normalize_audio(input_path: Path, output_path: Path) -> None:
    subprocess.run([
        "ffmpeg", "-y", "-i", str(input_path),
        "-ar", "16000", "-ac", "1", "-sample_fmt", "s16",
        str(output_path)
    ], check=True)

def transcribe_file(audio_path: Path) -> dict:
    with open(audio_path, "rb") as f:
        # Example using Whisper API; swap for Deepgram, AWS, etc.
        response = requests.post(
            "https://api.openai.com/v1/audio/transcriptions",
            headers={"Authorization": f"Bearer {API_KEY}"},
            files={"file": f},
            data={
                "model": "whisper-1",
                "response_format": "verbose_json",
                "timestamp_granularities[]": "word"
            }
        )
        response.raise_for_status()
        return response.json()

def process_upload(raw_path: Path) -> dict:
    clean_path = raw_path.with_suffix(".wav")
    normalize_audio(raw_path, clean_path)

    result = transcribe_file(clean_path)

    # Post-process: rebuild transcript with word-level timestamps
    words = result.get("words", [])
    segments = []
    current_segment = {"start": words[0]["start"], "text": ""}

    for word in words:
        current_segment["text"] += word["word"] + " "
        # New sentence heuristic
        if word["word"].endswith((".", "!", "?")):
            current_segment["end"] = word["end"]
            segments.append(current_segment)
            current_segment = {"start": None, "text": ""}

    return {
        "duration": result.get("duration"),
        "segments": segments,
        "raw": result.get("text")
    }

Here's what actually matters in that snippet. We ask for verbose_json and word timestamps. That granularity lets you rebuild sentences cleanly instead of accepting a wall of text. We also normalize audio before upload. Don't let users foot the bill for weird codecs.

Step 4: Add speaker labels without losing your mind

Speaker diarization is still the hardest part of transcription. Most APIs that offer it charge more, and the accuracy drops when speakers interrupt each other.

If your provider supports it, enable it at the API level. If not, you'll need a separate model like pyannote.audio or AWS's channel-based routing.

A cheap heuristic that works for interviews: force single-channel audio and ask the API to partition speakers. If that fails, fall back to a secondary diarization pass on the normalized file.

# Pseudo-code for dual-pass pipeline
transcript = transcribe_file(clean_path)
diarization = run_pyannote(clean_path)

# Merge word timestamps with speaker segments
for word in transcript["words"]:
    speaker = diarization.find_speaker_at(word["start"])
    word["speaker"] = speaker

It's not perfect. You'll still need manual review for content that ships to customers. But it gets you 90% of the way there.

Step 5: Format for humans, not machines

Nobody wants a raw JSON dump. Your end users want paragraphs, timestamps they can click, and speaker names.

Structure your final output like this:

{
  "segments": [
    {
      "speaker": "A",
      "start": 0.0,
      "end": 4.5,
      "text": "The API returns words, but humans read sentences."
    }
  ]
}

Export to SRT if you're building subtitles:

1
00:00:00,000 --> 00:00:04,500
The API returns words, but humans read sentences.

And always store the raw API response. When a user reports an error, you'll want to replay it without burning more credits.

The tradeoff framework you actually need

You'll face three knobs in production. Here's how to turn them.

Speed vs. accuracy
Fast modes use smaller models. Use them for search indexing and internal notes. Use best-quality models for customer-facing captions and compliance logs.

Cost vs. precision
Batch processing is cheaper per minute than real-time. If you don't need live captions, don't pay for streaming. Reserve premium engines (Google's latest_long, Nova-2) for your highest-value content.

Speaker labels vs. complexity
Don't enable diarization unless someone reads the labels. If it's just a giant blob of text for full-text search, skip it. You'll save money and processing time.

Common gotchas

Timestamps drift on long files over 30 minutes. Chunk at 10-minute boundaries if your API allows it.
Code switching (mixing languages in one file) breaks most monolingual models. Split by language if possible.
Profanity filters in enterprise APIs will asterisk out words in medical or legal transcripts. Disable them if your provider lets you.
WebRTC audio is often sampled at 48 kHz stereo. Downsample before sending.

Wrapping up

Building with AI transcription isn't hard. Building it so it doesn't break in production is. Preprocess your audio, pick an engine that matches your latency budget, and post-process the output into something readable. Treat the API like a component, not a magic wand.

Your users will thank you. Your wallet will too.

Top comments (1)

elboKazQC • Jul 7

Good field guide, the ffmpeg normalize line alone saves people days. One push-back on the code-switching gotcha: "split by language if possible" assumes the switch happens in blocks. In my case, dictation in Quebec French with English dev terms, it happens mid-sentence, word by word ("git commit", "useState", "le backend" inside one French clause), so there is no clean seam to split on. What moved the needle was model size: faster-whisper small was the floor, tiny and base collapse English tokens into phonetic French. Have you seen any monolingual model handle true intra-sentence switching, or is that still a wall?