nareshipme

Posted on Mar 21 • Edited on Mar 23

Audio Chunking for Long-Form Transcription: Splitting and Stitching with ffmpeg + TypeScript

#audio #ffmpeg #typescript #ai

APIs that do speech-to-text — Groq Whisper, OpenAI Whisper, and friends — all have one thing in common: a file size limit. Groq's hard cap is 25MB. A typical one-hour interview at decent quality can easily be 80–150MB. If you just try to send that, you'll get a 413 or a rate-limit error before the transcription even starts.

The fix is chunking: split the audio into manageable pieces, transcribe each one, then stitch the results back together — with correct timestamps. That last part is where most implementations go wrong.

Here's the approach I landed on, built around ffmpeg and TypeScript.

The Strategy

if file < 24MB → send directly (fast path)
else           → chunk into 20-min segments at 32kbps mono → transcribe each → stitch

The 20-minute / 32kbps combination keeps each chunk well under 5MB, which gives plenty of headroom below the 25MB limit regardless of source format.

Getting the Duration

Before splitting, we need to know how many chunks to expect. ffprobe (comes with ffmpeg) handles this cleanly:

import { execFile } from "child_process";
import { promisify } from "util";

const execFileAsync = promisify(execFile);

async function getAudioDurationSec(audioPath: string): Promise<number> {
  const { stdout } = await execFileAsync("ffprobe", [
    "-v", "error",
    "-show_entries", "format=duration",
    "-of", "default=noprint_wrappers=1:nokey=1",
    audioPath,
  ]);
  return parseFloat(stdout.trim());
}

Splitting with ffmpeg `-f segment`

The segment muxer is the right tool here — it splits at the specified interval, resets timestamps per chunk, and outputs numbered files:

async function splitAudioIntoChunks(
  audioPath: string,
  chunkDurationSec: number
): Promise<string[]> {
  const dir = path.dirname(audioPath);
  const base = path.basename(audioPath, ".mp3");
  const pattern = path.join(dir, `${base}-chunk-%03d.mp3`);

  await execFileAsync("ffmpeg", [
    "-i", audioPath,
    "-f", "segment",
    "-segment_time", String(chunkDurationSec),
    "-ar", "16000",   // 16kHz is enough for speech
    "-ac", "1",       // mono — halves file size vs. stereo
    "-b:a", "32k",    // 32kbps → ~4.8MB per 20min chunk
    "-reset_timestamps", "1",
    "-y",
    pattern,
  ]);

  // Collect chunks in order
  const chunkFiles: string[] = [];
  let i = 0;
  while (true) {
    const chunkPath = path.join(dir, `${base}-chunk-${String(i).padStart(3, "0")}.mp3`);
    if (!fs.existsSync(chunkPath)) break;
    chunkFiles.push(chunkPath);
    i++;
  }

  return chunkFiles;
}

Key flags to understand:

-ar 16000 — 16kHz sample rate is the standard for Whisper models; going higher wastes space without improving accuracy
-ac 1 — mono cuts file size in half; diarization (speaker separation) is handled by the STT API, not the audio channel count
-reset_timestamps 1 — each chunk's timestamps start at 0, which is what the API expects; we'll add the real offset ourselves during stitching
%03d — zero-padded index so glob/sort ordering is consistent

The Offset Problem (and How to Fix It)

When you transcribe chunk 3 (which starts at 40:00 in the original), the API returns segments like [{ start: 0.5, end: 4.2, text: "Hello" }] — relative to the chunk's start, not the original file.

The fix is simple: track a timeOffsetSec and add it to every segment:

async function transcribeChunk(
  filePath: string,
  timeOffsetSec: number
): Promise<TranscriptResult> {
  const transcription = await groq.audio.transcriptions.create({
    file: fs.createReadStream(filePath),
    model: "whisper-large-v3",
    response_format: "verbose_json",
  });

  const rawSegments = (transcription as any).segments ?? [];

  return {
    text: transcription.text,
    segments: rawSegments.map((seg: any, idx: number) => ({
      id: seg.id ?? idx,
      start: (seg.start ?? 0) + timeOffsetSec,  // ← key line
      end:   (seg.end   ?? 0) + timeOffsetSec,
      text:  seg.text,
    })),
  };
}

Stitching It All Together

export async function transcribeAudio(audioPath: string): Promise<TranscriptResult> {
  const stats = fs.statSync(audioPath);
  const GROQ_MAX_BYTES = 24 * 1024 * 1024;
  const CHUNK_DURATION_SEC = 20 * 60; // 20 minutes

  // Fast path
  if (stats.size <= GROQ_MAX_BYTES) {
    return transcribeChunk(audioPath, 0);
  }

  // Slow path
  const chunkFiles = await splitAudioIntoChunks(audioPath, CHUNK_DURATION_SEC);
  const results: TranscriptResult[] = [];

  for (let i = 0; i < chunkFiles.length; i++) {
    const timeOffsetSec = i * CHUNK_DURATION_SEC;
    const result = await transcribeChunk(chunkFiles[i], timeOffsetSec);
    results.push(result);

    // Clean up immediately — no point holding temp files in memory
    fs.unlink(chunkFiles[i], () => undefined);
  }

  // Stitch: join text with spaces, re-index segments globally
  return {
    text: results.map(r => r.text).join(" "),
    segments: results
      .flatMap(r => r.segments)
      .map((seg, idx) => ({ ...seg, id: idx })),
  };
}

The timeOffsetSec = i * CHUNK_DURATION_SEC calculation works because -reset_timestamps 1 makes each chunk start exactly at 0, so chunk N's real offset is exactly N × chunkDuration.

Practical Notes

Disk space: Chunks are cleaned up immediately after transcription. For a 2-hour file you'll briefly hold ~2 chunks on disk (the current one being transcribed + the previous being deleted). Keep temp files in os.tmpdir() or a designated scratch directory.

Boundary words: Words right at a chunk boundary can occasionally be cut mid-utterance. Most STT APIs handle this gracefully, but if you need perfect boundary handling, add a 2–3 second overlap between chunks and strip the overlap region from segments before stitching.

Rate limits: If you're processing many files, add a small delay between chunk requests. Groq's error messages include "Please try again in Xm Ys" — parse that and respect it:

function parseRetryAfterMs(message: string): number | null {
  const match = message.match(/try again in (\d+)m(\d+)s/);
  if (match) return (parseInt(match[1]) * 60 + parseInt(match[2]) + 5) * 1000;
  const secMatch = message.match(/try again in (\d+)s/);
  if (secMatch) return (parseInt(secMatch[1]) + 5) * 1000;
  return null;
}

Other providers: This pattern works with any size-limited STT API. Swap out the transcribeChunk internals for OpenAI, AssemblyAI, or Sarvam — the chunking and stitching logic stays the same.

Summary

Step	Tool	Key detail
Duration probe	`ffprobe`	Parse float from stdout
Split	`ffmpeg -f segment`	16kHz, mono, 32kbps, reset timestamps
Transcribe	Groq / OpenAI / etc.	`verbose_json` for segment data
Stitch timestamps	TypeScript	`seg.start + i * chunkDuration`
Cleanup	`fs.unlink`	Async, fire-and-forget after each chunk

The full implementation with Groq Whisper is about 150 lines and handles the fast path (small files go direct), slow path (chunking), and rate-limit retry messaging cleanly. Works in any Node.js environment where ffmpeg is installed.

DEV Community

Audio Chunking for Long-Form Transcription: Splitting and Stitching with ffmpeg + TypeScript

The Strategy

Getting the Duration

Splitting with ffmpeg `-f segment`

The Offset Problem (and How to Fix It)

Stitching It All Together

Practical Notes

Summary

Top comments (0)

The Strategy

Getting the Duration

Splitting with ffmpeg -f segment

The Offset Problem (and How to Fix It)

Stitching It All Together

Practical Notes

Summary

Splitting with ffmpeg `-f segment`