DEV Community: Toshius Klay

A Developer Workflow for Turning Podcast Episodes into Markdown Blog Posts

Toshius Klay — Tue, 14 Jul 2026 17:35:03 +0000

Transcribing my first podcast episode by hand took four hours. My fingers hurt. The draft was full of "[awkward laugh]" and sentences that went nowhere. I almost quit before I built a pipeline that does the hard work for me.

Now I go from WAV file to published Markdown without typing every "um" and "you know." Here's the exact workflow I use.

Step 1: Extract Clean Mono Audio

Transcription models are picky. Stereo files confuse them. Low sample rates make them hallucinate words. I learned this after Whisper turned a quiet guest channel into nonsense punctuation.

Before you send anything to the transcriber, strip silence and convert to 16 kHz mono. If you recorded a video, pull the audio track first.

ffmpeg -i raw_recording.mp3 -ar 16000 -ac 1 -af "silenceremove=start_periods=1:start_duration=0.5:start_threshold=-50dB" episode_clean.wav

That command forces 16 kHz, collapses to one channel, and strips dead air longer than half a second. It saves you from reading "[long pause]" every thirty seconds. I wish I'd known this sooner.

Step 2: Transcribe with Whisper and Export JSON

I run Whisper locally because I batch episodes and I don't trust cloud services with unreleased content. JSON output is the only format I use. Plain text dumps lose timestamps, and timestamps are your lifeline when you need to verify a technical term at 00:47:12.

whisper episode_clean.wav \
  --model medium \
  --language English \
  --output_format json \
  --output_dir ./transcripts/

No GPU? The tiny model still works fine for clean speech. You'll get segments, start times, and confidence scores in a file named episode_clean.json.

Step 3: Build a Searchable Draft Script

Raw JSON is unreadable. I wrote a five-minute script to flatten it into a draft I can actually scan.

import json

with open("transcripts/episode_clean.json") as f:
    data = json.load(f)

with open("draft.md", "w") as out:
    for seg in data["segments"]:
        minutes = int(seg["start"]) // 60
        seconds = int(seg["start"]) % 60
        out.write(f"[{minutes:02d}:{seconds:02d}] {seg['text'].strip()}\n")

Run it once and you get timestamps next to every line. Suddenly you can spot the exact moment your co-host said the thing about database locks.

[00:03:42] The real issue wasn't the database. It was the N+1 query we introduced in the last deploy.
[00:04:15] We didn't catch it in staging because the dataset there is tiny...

Now you can scan, delete fluff, and annotate without touching the original audio.

Step 4: Outline Before You Write

A blog post is not a transcript. I used to paste chunks of dialogue and wonder why readers bounced. Turns out people skim. They want a problem, a fix, and proof. Before I write a single paragraph, I tag the draft with structural markers.

## Outline

- Hook: The production outage on Tuesday
- Context: Why standard ORM patterns failed us
- Technical breakdown: Query analysis and indexing
- The fix: Cursor pagination and batching
- Takeaways: Three rules for ORM performance

Once the skeleton is solid, the rewrite is fast. You're connecting dots instead of inventing structure at midnight.

Step 5: Rewrite Speech into Scannable Prose

Spoken language is loose. You'll find filler words, fragments, and hedges like "kind of" and "sort of." Your job is to tighten them without making the speaker sound like a press release.

Before:

So, yeah, basically what happened was we, uh, we pushed this change and then, like, the latency just totally exploded and we were like, okay, this is bad.

After:

We pushed the change. Latency exploded immediately. We knew we had a problem.

I used to over-edit and kill the voice. Don't do that. Keep the honesty. Drop the noise. Break long thoughts into short paragraphs. Use bold for the terms a reader is scanning for. If two speakers debated a point, turn it into a comparison list. It reads faster than dialogue formatting.

Step 6: Add Dev.to-Ready Frontmatter and Metadata

Before I paste into the CMS, I lock down metadata. Good frontmatter makes the post discoverable and keeps my formatting consistent across platforms.

---
title: "How an N+1 Query Took Down Our API"
published: false
tags: database, performance, backend, podcast
canonical_url: https://yourpodcast.com/episodes/42
---

Use your primary keyword in the first 100 words and in at least one H2. Keep the slug readable. If you reference a repo or tool, drop an inline link. Internal links to related episodes keep people around.

Dev.to renders your title as the H1, so start the body with H2. Nest H3 under it. Don't skip levels. Search engines notice, and screen readers break.

One trick I stole from smarter writers: embed a short audio player or a timestamped link back to the original episode. Some people want the full tone. Let them jump to the minute they need.

Step 7: Final Polish and Publish

I store drafts in Git, so I run everything through a formatter before publishing. Insane line lengths make diffs unreadable.

npx prettier --prose-wrap always --write post.md

Preview on Dev.to before you hit publish. Check that code blocks scroll and headings nest. Audio content is rich. Written content is searchable. Combine the two, and one recording keeps working while you sleep.

Transcribe Audio to Text Like a Developer: From File to Final Text

Toshius Klay — Tue, 14 Jul 2026 17:24:49 +0000

You've got a two-hour architecture review sitting in your Downloads folder. Last month, mine was a 94-minute mess where three engineers talked over each other to debug a stuck Kafka consumer. I needed that in text. Not because I enjoy typing, but because text is searchable, diff-able, and way easier to share than a raw MP3. Manual transcription burns hours you don't have. Automated tools get you 90% there, but the last 10% is what separates a usable transcript from word salad.

For developers, text wins over audio every time. You can't grep a WAV file. You can't copy a function name from a podcast without listening to fifteen minutes of chatter. Once audio is text, you can index it, run diffs, feed it to an LLM, or generate subtitles. A single recording turns into searchable meeting notes, blog drafts, or captioned tutorials. It also makes content accessible to screen readers and non-native speakers.

Before you upload anything, pick your approach. There are three realistic options.

Automated APIs or local models. Tools like OpenAI Whisper running locally, or cloud speech APIs. Fast, cheap, and fine for clear audio with minimal crosstalk.
Manual transcription. You listen, you type. Still the best choice for courtroom-level accuracy or audio recorded in a crowded hallway.
Hybrid. Automated first, then a quick editorial pass. This is the default for technical content and the workflow I'll break down below.

If your recording is a screen share, a meeting, or a solo voice memo, go automated. If two people shouted over each other in a cafe, manual might be faster than fixing chaos.

Step-by-step workflow

1. Prepare the audio

Garbage in, garbage out. Models struggle with low bitrate, stereo separation, and background hum. Normalize your file before you process it.

Most tools accept MP3, WAV, M4A, MP4, and OGG. For speech recognition, mono WAV at 16 kHz is the safest bet. Use ffmpeg to clean things up:

# Convert to 16kHz mono WAV
ffmpeg -i input.mp4 -ar 16000 -ac 1 -c:a pcm_s16le cleaned.wav

# Strip leading silence to save processing time
ffmpeg -i cleaned.wav -af "silenceremove=start_periods=1:start_duration=0.5:start_threshold=-50dB" final.wav

Check your levels. If the waveform looks like a flat line or a brick wall, fix it before you waste a transcription run. I learned this the hard way after Whisper produced five minutes of hallucinated text from a track that was only audible in the left stereo channel. Mono fixes that.

2. Generate the raw transcript

You have two practical paths. Run Whisper locally if you care about privacy and cost. Hit an API if you want zero setup.

Local Python with Whisper:

import whisper

model = whisper.load_model("base")  # or "small", "medium", "large"
result = model.transcribe("final.wav")
print(result["text"])

Cloud API route:

from openai import OpenAI

client = OpenAI()
audio_file = open("final.wav", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="json"
)
print(transcript.text)

A thirty-minute file usually finishes in under a minute locally on a modern laptop. Cloud is roughly real-time or faster.

I used to run base for everything. Then I watched it turn "Kubernetes deployment" into "coorneddies deployment" during a mumbled standup. For anything with proper nouns, CLI commands, or made-up product names, medium or large is worth the extra seconds.

3. Capture structure

Don't just dump the text string. The JSON output contains segments with start and end times. You will need those for subtitles or for referencing specific moments.

segments = result["segments"]  # or transcript.segments depending on your client

for seg in segments:
    start = seg["start"]
    end = seg["end"]
    text = seg["text"].strip()
    print(f"[{start:.2f}s -> {end:.2f}s] {text}")

Store this as an intermediate JSON file. It is much easier to format later than re-running inference.

4. Clean with code

Raw transcripts are messy. Fillers, repeated words, and homophone errors creep in. Here is an actual snippet from that Kafka call:

[14:23] okay so um if you look at the um the logs it looks like the uh consumer group is like stuck and you know the lag is just um growing

My first thought was a regex to nuke every "um", "uh", "like", and "you know". I tried it. Then I ran it on a different transcript about Postgres and watched "you know, use a LIKE clause" turn into "use a clause". Not great.

Now I keep cleanup conservative. I only strip obvious stumbles, and I leave sentence words alone. Here is what actually runs:

import re

def clean_chunk(text):
    # Only remove isolated filler sounds, not words that appear in normal sentences
    text = re.sub(r'\s*\b(uh|um)\b[,.;:]?\s*', ' ', text, flags=re.IGNORECASE)
    # Collapse multiple spaces
    text = re.sub(r'\s+', ' ', text)
    # Fix spaces before punctuation
    text = re.sub(r'\s+([.,;!?])', r'\1', text)
    return text.strip()

# Apply to each segment
for seg in segments:
    seg["text"] = clean_chunk(seg["text"])

You will still need to eyeball technical terms. Whisper loves to hallucinate "main" as "mane" or turn "kubectl" into "cube cuddle" if the speaker trails off. A quick grep for your known terms catches the worst offenders.

5. Add timestamps and speaker labels

If you are building subtitles, export SRT or VTT. If you are building notes, Markdown with timestamps works better.

Markdown formatter:

def to_markdown(segments, title="Meeting Notes"):
    lines = [f"# {title}\n"]
    for seg in segments:
        t = seg["start"]
        lines.append(f"**[{t//60:02.0f}:{t%60:02.0f}]** {seg['text']}")
    return "\n\n".join(lines)

Speaker labels are harder. Whisper does basic speaker diarization in newer versions, but it is not perfect. For two-person interviews, you can often infer the speaker from context. For group meetings, consider a dedicated diarization tool like pyannote.audio, then merge the outputs.

I tried pyannote on a five-person standup once. It worked, but the dependency chain was heavier than the transcript itself. For 1:1s, I skip it. For all-hands, I bite the bullet.

6. Export and reuse

Match the format to the destination.

Markdown for blog posts or documentation.
SRT or VTT for video subtitles.
Plain text for LLM context windows.

# Quick plain-text dump
with open("output.txt", "w") as f:
    f.write("\n".join([s["text"] for s in segments]))

One recording should have multiple lives. Turn a tech talk transcript into a docs page, a Twitter thread, and a set of searchable meeting notes without re-listening to the whole thing.

Accuracy and gotchas

Even large models hallucinate on jargon, acronyms, and numbers. Expect to spend five to ten minutes editing per thirty minutes of clean audio. Budget more if speakers overlap or if the content is heavy on CLI commands and made-up product names.

If your API supports it, pass a prompt with domain context. For Whisper, the initial_prompt parameter lets you seed terms like "Next.js, Redis, and kubectl" so the model knows the vocabulary ahead of time.

During that Kafka review, seeding "consumer lag, partition, and rebalance" stopped Whisper from inventing "consumer leg" and "reborn". Small prompt, big difference.

Final checklist

Before you publish or archive:

Correct all function names, CLI flags, and API endpoints
Verify that timestamps land near the actual speaker changes
Confirm speaker labels if you added them manually
Pick the right export format for your pipeline

Transcription is not magic. It is just another data transformation. Prep your audio, run inference, clean the output, and ship it. Your future self will thank you when you can grep last month's architecture decision in under a second.

How to Transcribe Audio to Text: A Practical Workflow for Developers

Toshius Klay — Tue, 14 Jul 2026 17:24:24 +0000

You've got a 45-minute meeting recording sitting on your desktop. Maybe it's a podcast raw file, or an interview you did for a project. You need it in text, now. Searchable. Timestamped. Clean enough to paste into docs or shove into an LLM context window.

Typing it out by hand sucks. It takes about four times the audio length, and your fingers will hate you. I spent a rainy Tuesday afternoon doing exactly that once. Never again.

Transcription is the bridge between spoken content and text you can actually use. Once audio is structured text, you can grep it, diff it, subtitle it, or feed it to automation. This is the workflow I use now: prep the file, run speech-to-text, clean the output, and export to formats developers actually need.

Why audio to text still matters

Video gets all the hype, but audio is where the actual work lives. Standups, conference talks, voice memos, screencasts. Text makes all of that scannable. It improves accessibility. It creates training data. It lets you pull an exact quote from a user interview without scrubbing through a waveform like you're tuning an old radio.

For developers, transcripts turn unstructured rambling into structured assets. Meeting minutes become markdown files. Tutorial audio becomes documentation you can version control. You can index transcripts in Elasticsearch, embed them for RAG pipelines, or diff versions to see how your docs evolved. I keep a folder of interview transcripts that I treat as a searchable database. Way faster than relistening.

Choose your approach

There are two broad paths: automated and manual. Most real workflows use both, because neither is perfect alone.

Automated transcription means models like OpenAI Whisper, cloud APIs, or local engines. It's fast, and for clear audio it's surprisingly good. The second someone crosstalks, mumbles, or has a heavy accent, accuracy tanks. I've seen Whisper hallucinate entire sentences because a truck drove by.

Manual transcription is just you, a text editor, and a media player. It's tedious. I only do this when accuracy is non-negotiable, like legal stuff or medical content where a wrong word matters.

The hybrid method is the sweet spot. Run automation first, then edit. You get 80 to 90 percent of the way there in minutes. Then you spend human time on polish instead of grunt work. That's where I live now.

Step-by-step: from raw audio to clean text

1. Prepare your audio file

Garbage in, garbage out. Models perform best on clean, single-channel audio at 16 kHz. If your source is a noisy Zoom call or a multi-speaker session, do a quick cleanup pass.

Most tools accept MP3, WAV, M4A, MP4, AAC, FLAC, OGG, WEBM. If you're scripting a pipeline, standardize on WAV or FLAC for lossless input. ffmpeg handles the conversion:

# Extract audio from video, downmix to mono, resample to 16kHz
ffmpeg -i recording.mp4 -ar 16000 -ac 1 -c:a pcm_s16le cleaned.wav

# Strip leading/trailing silence to save processing time
ffmpeg -i cleaned.wav -af "silenceremove=start_periods=1:start_duration=0.5:start_threshold=-50dB" trimmed.wav

Check your levels. If one speaker whispers and another shouts, normalize it. Consistent volume drops the word-error rate more than you'd think. I learned this the hard way with a podcast where the host was ten decibels quieter than the guest. The transcript was a mess until I fixed the gain.

2. Transcribe with code

For developers, the most transparent option is running Whisper locally or hitting an API. You control the model, the parameters, and the output format. No black boxes.

Here's a minimal Python snippet using the openai-whisper package:

import whisper

model = whisper.load_model("base")
result = model.transcribe("trimmed.wav", language="en")

# Save raw text
with open("output.txt", "w") as f:
    f.write(result["text"])

# Save segments with timestamps as JSON
import json
with open("output.json", "w") as f:
    json.dump(result["segments"], f, indent=2)

Need subtitles? Generate an SRT directly:

from whisper.utils import get_writer

writer = get_writer("srt", ".")
writer.write_result(result, "subtitles.srt")

Cloud services like AssemblyAI or Deepgram offer speaker diarization out of the box. That's handy when you need to label who said what in a meeting. Just watch the API costs. They add up fast if you're processing hours of audio, and nothing hurts like an unexpected bill because you forgot to delete a test file.

3. Edit the raw transcript

Automated output is never final. You'll find missing punctuation, homophone errors, and completely invented words. My favorite was when Whisper decided a speaker said "Kubernetes" but meant "commuter news."

The fastest cleanup method is reading while listening at 1.5x speed. Fix spelling, break up wall-of-text paragraphs, and correct technical terms. Your ears catch what your eyes miss.

If you're building a pipeline, add a programmatic first pass. A simple regex strips filler words like "um" and "uh" if you want a clean read:

import re

text = result["text"]
cleaned = re.sub(r'\b(um|uh|like,|you know,)\b[,]?', '', text, flags=re.IGNORECASE)
cleaned = re.sub(r'\s+', ' ', cleaned)  # collapse extra spaces

Decide on verbatim versus clean. Verbatim keeps false starts and repeats. Clean is easier to read. Match the style to the destination. Documentation wants clean. User research quotes might need verbatim so you don't misrepresent someone's intent.

4. Add structure

Raw text is brutal to scan. Add speaker labels and timestamps.

If Whisper gave you segments, map them:

for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"].strip()
    print(f"[{start:04.1f}s - {end:04.1f}s] {text}")

For multi-speaker recordings, use diarization. Libraries like pyannote.audio integrate with Whisper to label speakers:

# Pseudo-code for diarization pipeline
diarization = pyannote_pipeline("trimmed.wav")
for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"Speaker {speaker}: {turn.start:.1f}s to {turn.end:.1f}s")

Not every project needs this. But when you're turning a standup recording into structured minutes, speaker labels are essential. I skipped this once and spent twenty minutes trying to figure out which engineer suggested the database migration.

5. Export for your stack

Transcripts are useless if they're trapped in a weird format. Export to whatever your workflow eats.

Markdown for docs, blogs, and Git repos.
JSON for downstream parsing, LLM context windows, or search indexing.
SRT/VTT for video subtitles and web players.

Here's a quick script to convert Whisper segments to markdown with timestamps:

lines = ["# Meeting Transcript\n"]
for seg in result["segments"]:
    ts = f"{int(seg['start'] // 60)}:{int(seg['start'] % 60):02d}"
    lines.append(f"**{ts}** {seg['text'].strip()}\n")

with open("transcript.md", "w") as f:
    f.writelines(lines)

Store the raw JSON alongside the cleaned text. Future you will want those timestamps when someone asks, "What did they say at minute twelve?"

Accuracy tips that actually matter

Model choice matters, but audio hygiene matters more.

Use a decent microphone. A $20 lavalier beats a laptop mic every time. I use one for all my calls now, and the transcription accuracy jumped noticeably.
Reduce echo. Record in smaller rooms or use software noise suppression before transcription. Bare walls are the enemy.
Avoid overlapping speech. Diarization breaks when people talk over each other, and manual cleanup becomes a nightmare.
Spell out acronyms. Models often hallucinate letter sequences. I keep a post-processing dictionary for recurring product names and jargon. It saves me from fixing "AWS" becoming "Ay-Double-U-Ess" or some other nonsense.

Putting it into practice

A solid transcription workflow is just a pipeline. Audio enters, text leaves. Prep, inference, cleanup, export. Script what you can. Automate the boring parts. Keep a human review step for anything public-facing.

Start with one file. Convert it, run Whisper, clean the output, and export to markdown. Time yourself. You'll spot your bottleneck fast, whether it's audio quality, editing, or formatting. Fix that first. The rest is just repetition.

How to Turn Interview Audio into Analysis-Ready Transcripts

Toshius Klay — Fri, 03 Jul 2026 14:10:02 +0000

Last year I transcribed forty hours of developer interviews by hand because I didn't trust the AI tools. My wrists hurt. I missed a deadline. I still botched a quote. One participant said they hated Docker. I typed "loved Docker." That single error skewed my feature priority matrix for a week.

Now I use a workflow that is boring, repeatable, and won't let garbage audio wreck your dataset. It goes like this.

1. Record audio that won't wreck your accuracy

I learned this in a glass-walled conference room with a laptop mic. The echo was so bad that "Git" became "get" for two straight hours. I had to guess context on twelve different lines. Never again.

Use a directional mic or a decent USB interface. Laptop mics grab keyboard clatter and fan hum. Record in a small, carpeted room if you can. Hard surfaces bounce sound and confuse speech engines.

For remote sessions, make participants wear headphones. It stops their speakers from bleeding into your recording. Ask people not to talk over each other. If you need speaker labels later, have them introduce themselves at the top.

If you are pulling audio from a Zoom recording, normalize it first. ffmpeg handles this in one pass:

ffmpeg -i interview.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 interview.wav

Mono WAV at 16kHz. Speech engines prefer it. Mono removes stereo separation weirdness, and 16kHz covers the vocal range without bloating your file.

2. Generate the draft automatically

I upload the file to whatever engine I'm using that month. Lately it is Whisper.cpp running locally because I got paranoid about participant data hitting cloud APIs. Last year I burned through Otter credits. The service matters less than the settings.

I pick verbatim mode when I am looking for hesitation or power dynamics. It keeps the ums, false starts, and pauses. I pick clean mode for thematic analysis or when I am handing quotes to a PM who just wants the point, not the verbal stumbles.

If the tool offers auto-detect for language, verify it. I once ran a mixed English-German session and the engine tagged the whole file as Dutch. The gibberish propagated for pages before I noticed.

Do not treat the raw file as final. Automated transcripts hit maybe 85-95% accuracy in ideal conditions. Accents, jargon, and crosstalk drop that number fast.

3. Review like you are being audited

This is the step I used to skip. It cost me.

Open the draft next to the audio. Play it at 1.0x or 1.25x. Fix these exact things:

Misheard domain terms. "React" becomes "reactant." "Kubernetes" turns into phonetic mush. These look tiny but destroy coding accuracy.
Speaker labels. Auto-tools merge speakers during crosstalk. Label each turn yourself.
Crosstalk and drops. When two people talk at once, the transcript may mash both voices into nonsense. Mark gaps with [inaudible] so you do not code silence as agreement.
Punctuation for meaning. A missing comma can flip enthusiasm into sarcasm.
Nonverbal cues only if they matter. I mark laughter or long pauses with a standard notation. I skip them if I am just hunting for feature requests.

Here is the template I paste into my editor:

[00:03:15] Interviewer: Walk me through how you deploy to production.
[00:03:18] Participant: Usually we just run the script, wait for the build, and then... actually, sometimes we check the logs first.
[00:03:24] [pause 3s]
[00:03:27] Participant: If it's a Friday, we don't deploy at all.
[00:03:30] [laughter]

ISO timestamps and bracketed tags. I keep lines under 100 characters so they import cleanly into qualitative tools.

4. Format for your analysis stack

I've imported transcripts into NVivo, Dovetail, Atlas.ti, and plain Git repos. Consistency matters more than aesthetics. NVivo chokes on inconsistent timestamps. Dovetail gets weird if your speaker labels change format between files.

Standardize these before you call it done:

Speaker names. Pick "Interviewer / Participant" or "P1 / P2" and stick with it across every file.
Paragraph breaks. Start a new paragraph when the topic shifts, not just when someone stops talking.
Timestamps. Drop them every 30-60 seconds, or at every speaker turn if your tool requires it.

If your team works async across time zones, timestamps are the only way another researcher can pull the original clip for context.

When I store transcripts in Git for team review, I add YAML frontmatter so we can search later:

---
project: onboarding-research
session_id: 2024-06-12_p5
participant_id: P5
method: semi-structured-interview
transcript_type: clean
duration_minutes: 42
---

This turns a folder of text files into something you can actually query.

5. Export and version your files

Save two copies. Every time.

Raw automated output. This is your audit trail.
Corrected transcript with final labels and formatting.

Export depends on your pipeline:

TXT or MD for coding platforms.
DOCX if your team lives in Word comments.
JSON if you are feeding them into a custom NLP pipeline.

Keep both. Six months later, when a stakeholder asks if that brutal quote is real, you need to trace it back to the source audio without starting from zero.

Verbatim vs. clean: pick one and lock it in

I once switched formats mid-project because I got lazy. I had to re-review every file. Do not be me.

Use verbatim when:

You are studying language patterns, pauses, or interaction dynamics.
Your methodology is discourse or conversation analysis.

Use clean when:

You are hunting for themes, pain points, or feature requests.
A PM or executive will read it and only cares about the content, not the delivery.

Most of my UX work uses clean transcripts. Academic work usually needs verbatim. You can generate both. Keep the verbatim file as your source of truth, then derive a cleaned copy for reporting.

Closing the loop

Good transcription is a quality gate for your whole project. A transcript with mislabeled speakers or missing context will send your coding sideways. You will not notice until you are writing findings at 11 PM and questioning your sanity.

Record clean audio. Generate a draft. Review it line by line with the original playing. Format it consistently. Lock in your raw and edited versions.

Your future self, staring at fifty coded segments at 11 PM, will thank you.

Build a Reliable AI Transcription Pipeline: A Developer’s Field Guide

Toshius Klay — Fri, 03 Jul 2026 14:09:08 +0000

You shipped the feature last Tuesday. Upload audio, hit transcribe, display text. By Friday your users were complaining about garbled timestamps, missing speaker labels, and a bill that made your CFO flinch. Raw API output isn't enough for production. You need a pipeline.

Most speech-to-text tutorials stop at curl. They don't cover audio preprocessing, model selection, or how to clean up the mess that comes back when three people talk over each other in a Zoom recording. This guide walks through what actually works.

How the sausage gets made

AI transcription isn't a black box you lob files into. It's a chain of decisions. Audio gets normalized, chunked, fed to an acoustic model, and reconstructed into text. Then a language model guesses punctuation and paragraphs. If you want speaker labels, a separate diarization model runs alongside it.

The pipeline looks like this:

Audio input and format normalization
Chunking and resampling
Model inference (ASR)
Post-processing (punctuation, formatting)
Speaker diarization (optional but painful)
Export and storage

Skip step 1 or 2 and you'll pay for step 3 twice.

Step 1: Fix your audio before it hits the API

Developers love to send whatever comes out of the browser's <input type="file"> straight to the cloud. Don't. APIs have preferences, and your users upload garbage.

Standardize on these specs:

Format: Mono WAV or FLAC. Stereo confuses some models.
Sample rate: 16 kHz or 24 kHz. Resample if you have to.
Bitrate: 16-bit PCM. No 32-bit float surprises.
Preprocessing: Normalize loudness to -16 LUFS. Strip silence longer than 2 seconds if your ASR bills by duration.

Use ffmpeg. It's ugly but it's everywhere.

ffmpeg -i input.m4a -ar 16000 -ac 1 -sample_fmt s16 output.wav

That one line fixes half the accuracy issues you'll see in production. It converts variable-bitrate user uploads into something the model actually expects.

Step 2: Pick the right engine for the job

Not all transcription APIs are the same. They optimize for different things. Here's the honest breakdown:

OpenAI Whisper (API or self-hosted)

Great accuracy across languages
Cheap on API, cheaper if you run base or small locally on a CPU
No native speaker diarization
Slower than cloud providers on long files

Google Cloud Speech-to-Text

Excellent real-time streaming via gRPC
Good speaker diarization (up to 8 speakers in some configs)
Pricier, especially with premium models like latest_long
Needs audio in specific containers (LINEAR16, MULAW, etc.)

AWS Transcribe

Solid medical and call analytics variants
Speaker partitioning works but lags behind Google
Turnaround is batch-oriented; real-time exists but feels bolted on

Deepgram Nova

Fast. Like, actually fast.
Good at messy audio (background noise, accents)
Speaker diarization is decent but costs extra tiers

For most apps, Whisper hits the sweet spot of cost and accuracy. If you need live captions during a WebRTC call, Google Cloud's streaming API is hard to beat.

Step 3: Code a resilient batch pipeline

Here's a minimal Python worker that handles the full flow. It preprocesses with ffmpeg, sends to an API, and structures the output. Swap in your provider of choice.

import subprocess
import json
import requests
from pathlib import Path

def normalize_audio(input_path: Path, output_path: Path) -> None:
    subprocess.run([
        "ffmpeg", "-y", "-i", str(input_path),
        "-ar", "16000", "-ac", "1", "-sample_fmt", "s16",
        str(output_path)
    ], check=True)

def transcribe_file(audio_path: Path) -> dict:
    with open(audio_path, "rb") as f:
        # Example using Whisper API; swap for Deepgram, AWS, etc.
        response = requests.post(
            "https://api.openai.com/v1/audio/transcriptions",
            headers={"Authorization": f"Bearer {API_KEY}"},
            files={"file": f},
            data={
                "model": "whisper-1",
                "response_format": "verbose_json",
                "timestamp_granularities[]": "word"
            }
        )
        response.raise_for_status()
        return response.json()

def process_upload(raw_path: Path) -> dict:
    clean_path = raw_path.with_suffix(".wav")
    normalize_audio(raw_path, clean_path)

    result = transcribe_file(clean_path)

    # Post-process: rebuild transcript with word-level timestamps
    words = result.get("words", [])
    segments = []
    current_segment = {"start": words[0]["start"], "text": ""}

    for word in words:
        current_segment["text"] += word["word"] + " "
        # New sentence heuristic
        if word["word"].endswith((".", "!", "?")):
            current_segment["end"] = word["end"]
            segments.append(current_segment)
            current_segment = {"start": None, "text": ""}

    return {
        "duration": result.get("duration"),
        "segments": segments,
        "raw": result.get("text")
    }

Here's what actually matters in that snippet. We ask for verbose_json and word timestamps. That granularity lets you rebuild sentences cleanly instead of accepting a wall of text. We also normalize audio before upload. Don't let users foot the bill for weird codecs.

Step 4: Add speaker labels without losing your mind

Speaker diarization is still the hardest part of transcription. Most APIs that offer it charge more, and the accuracy drops when speakers interrupt each other.

If your provider supports it, enable it at the API level. If not, you'll need a separate model like pyannote.audio or AWS's channel-based routing.

A cheap heuristic that works for interviews: force single-channel audio and ask the API to partition speakers. If that fails, fall back to a secondary diarization pass on the normalized file.

# Pseudo-code for dual-pass pipeline
transcript = transcribe_file(clean_path)
diarization = run_pyannote(clean_path)

# Merge word timestamps with speaker segments
for word in transcript["words"]:
    speaker = diarization.find_speaker_at(word["start"])
    word["speaker"] = speaker

It's not perfect. You'll still need manual review for content that ships to customers. But it gets you 90% of the way there.

Step 5: Format for humans, not machines

Nobody wants a raw JSON dump. Your end users want paragraphs, timestamps they can click, and speaker names.

Structure your final output like this:

{
  "segments": [
    {
      "speaker": "A",
      "start": 0.0,
      "end": 4.5,
      "text": "The API returns words, but humans read sentences."
    }
  ]
}

Export to SRT if you're building subtitles:

1
00:00:00,000 --> 00:00:04,500
The API returns words, but humans read sentences.

And always store the raw API response. When a user reports an error, you'll want to replay it without burning more credits.

The tradeoff framework you actually need

You'll face three knobs in production. Here's how to turn them.

Speed vs. accuracy
Fast modes use smaller models. Use them for search indexing and internal notes. Use best-quality models for customer-facing captions and compliance logs.

Cost vs. precision
Batch processing is cheaper per minute than real-time. If you don't need live captions, don't pay for streaming. Reserve premium engines (Google's latest_long, Nova-2) for your highest-value content.

Speaker labels vs. complexity
Don't enable diarization unless someone reads the labels. If it's just a giant blob of text for full-text search, skip it. You'll save money and processing time.

Common gotchas

Timestamps drift on long files over 30 minutes. Chunk at 10-minute boundaries if your API allows it.
Code switching (mixing languages in one file) breaks most monolingual models. Split by language if possible.
Profanity filters in enterprise APIs will asterisk out words in medical or legal transcripts. Disable them if your provider lets you.
WebRTC audio is often sampled at 48 kHz stereo. Downsample before sending.

Wrapping up

Building with AI transcription isn't hard. Building it so it doesn't break in production is. Preprocess your audio, pick an engine that matches your latency budget, and post-process the output into something readable. Treat the API like a component, not a magic wand.

Your users will thank you. Your wallet will too.