Audio Chunking for Long-Form Transcription: Splitting and Stitching with ffmpeg + TypeScript

#audio #ffmpeg #typescript #ai

APIs that do speech-to-text — Groq Whisper, OpenAI Whisper, and friends — all have one thing in common: a file size limit. Groq's hard cap is 25MB. A typical one-hour interview at decent quality can easily be 80–150MB. If you just try to send that, you'll get a 413 or a rate-limit error before the transcription even starts.

The fix is chunking: split the audio into manageable pieces, transcribe each one, then stitch the results back together — with correct timestamps. That last part is where most implementations go wrong.

Here's the approach I landed on, built around ffmpeg and TypeScript.

The Strategy

if file < 24MB → send directly (fast path)
else           → chunk into 20-min segments at 32kbps mono → transcribe each → stitch

The 20-minute / 32kbps combination keeps each chunk well under 5MB, which gives plenty of headroom below the 25MB limit regardless of source format.

DEV Community

Audio Chunking for Long-Form Transcription: Splitting and Stitching with ffmpeg + TypeScript

The Strategy

Top comments (0)