DEV Community

nareshipme
nareshipme

Posted on

Building a Robust Real-Time Transcription Pipeline in Next.js: STT, Streaming, and Error Recovery

Building a Robust Real-Time Transcription Pipeline in Next.js: STT, Streaming, and Error Recovery

Transcribing video and audio at scale is deceptively complex. You're not just calling an STT API and returning the result — you're managing file chunking, concurrent requests, partial failures, and real-time updates. Here's how we built a production-grade transcription pipeline in Next.js.

The Problem

When processing user-uploaded videos, we needed:

  • Streaming results as transcription completes (not waiting for the whole file)
  • Automatic retry on API failures without losing progress
  • Chunk-based processing to avoid timeout and memory limits
  • Multi-language support with fallback providers
  • Real-time UI updates as captions are generated

Architecture

We use a three-layer approach:

  1. Client: Stream audio blobs in 30-second chunks
  2. API Route: Queue chunks, handle retries, emit progress events
  3. Background Worker: Heavy lifting with Inngest (idempotent steps, automatic retries)

Layer 1: Client-Side Streaming

// Send 30-second chunks to the API
async function transcribeChunked(file: File) {
  const CHUNK_SIZE = 30 * 1000; // 30 seconds
  const duration = await getAudioDuration(file);

  for (let start = 0; start < duration; start += CHUNK_SIZE) {
    const blob = await extractAudioChunk(file, start, start + CHUNK_SIZE);
    const formData = new FormData();
    formData.append('audio', blob);
    formData.append('startTime', start.toString());

    const response = await fetch('/api/transcribe', {
      method: 'POST',
      body: formData,
    });

    // Stream events back
    const reader = response.body!.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      const text = decoder.decode(value);
      const lines = text.split('\n').filter(Boolean);

      for (const line of lines) {
        const event = JSON.parse(line);
        if (event.type === 'caption') {
          updateCaption(event.text, event.startTime);
        }
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Layer 2: API Route with Queueing

// app/api/transcribe/route.ts
import { inngest } from '@/lib/inngest';

export async function POST(req: Request) {
  const formData = await req.formData();
  const blob = formData.get('audio') as Blob;
  const startTime = parseInt(formData.get('startTime') as string);
  const videoId = formData.get('videoId') as string;

  // Trigger background job, return streaming response
  const runId = await inngest.send({
    name: 'video/transcribe-chunk',
    data: { videoId, audioBuffer: await blob.arrayBuffer(), startTime },
  });

  // Stream progress back to client
  return new Response(
    streamTranscriptionProgress(videoId, runId),
    { headers: { 'Content-Type': 'application/x-ndjson' } }
  );
}
Enter fullscreen mode Exit fullscreen mode

Layer 3: Inngest Worker with Resilience

// lib/inngest.ts
inngest.createFunction(
  { id: 'transcribe-chunk', priority: { run: 10 } },
  { event: 'video/transcribe-chunk' },
  async ({ event, step }) => {
    const { videoId, audioBuffer, startTime } = event.data;

    // Step 1: Try primary STT provider (Sarvam for Indian languages)
    let transcript = await step.run('sarvam-stt', async () => {
      try {
        return await callSarvamSTT(audioBuffer);
      } catch (err) {
        if (err.status === 429) throw err; // Retry
        return null; // Fallback
      }
    });

    // Step 2: Fallback to Gemini if primary fails
    if (!transcript) {
      transcript = await step.run('gemini-stt', async () => {
        return await callGeminiSTT(audioBuffer);
      });
    }

    // Step 3: Store result (idempotent)
    await step.run('store-caption', async () => {
      await db.captions.upsert({
        videoId,
        startTime,
        text: transcript,
      });
    });

    // Step 4: Emit to client
    await step.run('emit-progress', async () => {
      await publishEvent(`transcribe:${videoId}`, {
        type: 'caption',
        text: transcript,
        startTime,
      });
    });
  }
);
Enter fullscreen mode Exit fullscreen mode

Key Patterns

1. Retry Logic Without Infinite Loops

Inngest handles retries automatically, but we distinguish:

  • 429 (rate limit): Throw to trigger exponential backoff
  • Network error: Auto-retry (Inngest)
  • Invalid audio: Fail gracefully, skip chunk
if (err.message.includes('invalid audio format')) {
  await db.captions.create({ videoId, startTime, text: '[SILENT]' });
  return; // Don't retry
}

throw err; // Let Inngest retry
Enter fullscreen mode Exit fullscreen mode

2. Streaming Results Early

Don't wait for all chunks. Emit captions as they complete:

await publishEvent(`transcribe:${videoId}`, {
  type: 'progress',
  processed: Math.floor((startTime / duration) * 100),
});
Enter fullscreen mode Exit fullscreen mode

3. Idempotent Storage

Always use upsert for captions, keyed by (videoId, startTime):

db.captions.upsert({
  videoId,
  startTime, // Natural composite key
  text: transcript,
});
Enter fullscreen mode Exit fullscreen mode

This way, retries don't create duplicates.

4. Detect Language Dynamically

For multi-language content, detect before selecting STT:

const language = await step.run('detect-language', async () => {
  const sample = audioBuffer.slice(0, 10000); // First 100ms
  return await detectLanguage(sample);
});

if (language === 'hi') return callSarvamSTT(audioBuffer);
return callGeminiSTT(audioBuffer);
Enter fullscreen mode Exit fullscreen mode

Performance Tuning

  • Chunk size: 30s balances API limits vs. latency
  • Concurrency: Process up to 3 chunks in parallel, then rate-limit
  • Timeout: Set Inngest step timeout to 60s (STT can take 20-40s)
  • Storage: Cache transcript in Redis during processing, write to DB once

Monitoring

Track:

  • Transcription latency per chunk (should be <2x audio duration)
  • Retry rate by provider (>5% means reconsider fallback)
  • Cost per minute of audio (to catch runaway bugs)
posthog.capture('transcribe_chunk', {
  duration: 30,
  latency: Date.now() - start,
  provider: 'sarvam',
  success: true,
});
Enter fullscreen mode Exit fullscreen mode

Summary

A robust transcription pipeline is:

  1. Chunked at the network boundary (30s chunks)
  2. Queued with automatic retries (Inngest)
  3. Streaming to avoid client timeouts
  4. Redundant with fallback providers
  5. Idempotent to survive failures gracefully

Once you have this foundation, adding features (speaker diarization, emotion detection, keyword highlighting) becomes straightforward — they're just additional steps in the pipeline.

Have you hit transcription limits? Drop a comment with your approach.

Top comments (0)