APIs that do speech-to-text — Groq Whisper, OpenAI Whisper, and friends — all have one thing in common: a file size limit. Groq's hard cap is 25MB. A typical one-hour interview at decent quality can easily be 80–150MB. If you just try to send that, you'll get a 413 or a rate-limit error before the transcription even starts.
The fix is chunking: split the audio into manageable pieces, transcribe each one, then stitch the results back together — with correct timestamps. That last part is where most implementations go wrong.
Here's the approach I landed on, built around ffmpeg and TypeScript.
The Strategy
if file < 24MB → send directly (fast path)
else → chunk into 20-min segments at 32kbps mono → transcribe each → stitch
The 20-minute / 32kbps combination keeps each chunk well under 5MB, which gives plenty of headroom below the 25MB limit regardless of source format.
Top comments (0)