nareshipme

Posted on Apr 11

Building a Robust Real-Time Transcription Pipeline in Next.js: STT, Streaming, and Error Recovery

#nextjs #transcription #inngest #backend

Building a Robust Real-Time Transcription Pipeline in Next.js: STT, Streaming, and Error Recovery

Transcribing video and audio at scale is deceptively complex. You're not just calling an STT API and returning the result — you're managing file chunking, concurrent requests, partial failures, and real-time updates. Here's how we built a production-grade transcription pipeline in Next.js.

The Problem

When processing user-uploaded videos, we needed:

Streaming results as transcription completes (not waiting for the whole file)
Automatic retry on API failures without losing progress
Chunk-based processing to avoid timeout and memory limits
Multi-language support with fallback providers
Real-time UI updates as captions are generated

Architecture

We use a three-layer approach:

Client: Stream audio blobs in 30-second chunks
API Route: Queue chunks, handle retries, emit progress events
Background Worker: Heavy lifting with Inngest (idempotent steps, automatic retries)

Layer 1: Client-Side Streaming

// Send 30-second chunks to the API
async function transcribeChunked(file: File) {
  const CHUNK_SIZE = 30 * 1000; // 30 seconds
  const duration = await getAudioDuration(file);

  for (let start = 0; start < duration; start += CHUNK_SIZE) {
    const blob = await extractAudioChunk(file, start, start + CHUNK_SIZE);
    const formData = new FormData();
    formData.append('audio', blob);
    formData.append('startTime', start.toString());

    const response = await fetch('/api/transcribe', {
      method: 'POST',
      body: formData,
    });

    // Stream events back
    const reader = response.body!.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      const text = decoder.decode(value);
      const lines = text.split('\n').filter(Boolean);

      for (const line of lines) {
        const event = JSON.parse(line);
        if (event.type === 'caption') {
          updateCaption(event.text, event.startTime);
        }
      }
    }
  }
}

Layer 2: API Route with Queueing

// app/api/transcribe/route.ts
import { inngest } from '@/lib/inngest';

export async function POST(req: Request) {
  const formData = await req.formData();
  const blob = formData.get('audio') as Blob;
  const startTime = parseInt(formData.get('startTime') as string);
  const videoId = formData.get('videoId') as string;

  // Trigger background job, return streaming response
  const runId = await inngest.send({
    name: 'video/transcribe-chunk',
    data: { videoId, audioBuffer: await blob.arrayBuffer(), startTime },
  });

  // Stream progress back to client
  return new Response(
    streamTranscriptionProgress(videoId, runId),
    { headers: { 'Content-Type': 'application/x-ndjson' } }
  );
}

Layer 3: Inngest Worker with Resilience

// lib/inngest.ts
inngest.createFunction(
  { id: 'transcribe-chunk', priority: { run: 10 } },
  { event: 'video/transcribe-chunk' },
  async ({ event, step }) => {
    const { videoId, audioBuffer, startTime } = event.data;

    // Step 1: Try primary STT provider (Sarvam for Indian languages)
    let transcript = await step.run('sarvam-stt', async () => {
      try {
        return await callSarvamSTT(audioBuffer);
      } catch (err) {
        if (err.status === 429) throw err; // Retry
        return null; // Fallback
      }
    });

    // Step 2: Fallback to Gemini if primary fails
    if (!transcript) {
      transcript = await step.run('gemini-stt', async () => {
        return await callGeminiSTT(audioBuffer);
      });
    }

    // Step 3: Store result (idempotent)
    await step.run('store-caption', async () => {
      await db.captions.upsert({
        videoId,
        startTime,
        text: transcript,
      });
    });

    // Step 4: Emit to client
    await step.run('emit-progress', async () => {
      await publishEvent(`transcribe:${videoId}`, {
        type: 'caption',
        text: transcript,
        startTime,
      });
    });
  }
);

Key Patterns

1. Retry Logic Without Infinite Loops

Inngest handles retries automatically, but we distinguish:

429 (rate limit): Throw to trigger exponential backoff
Network error: Auto-retry (Inngest)
Invalid audio: Fail gracefully, skip chunk

if (err.message.includes('invalid audio format')) {
  await db.captions.create({ videoId, startTime, text: '[SILENT]' });
  return; // Don't retry
}

throw err; // Let Inngest retry

2. Streaming Results Early

Don't wait for all chunks. Emit captions as they complete:

await publishEvent(`transcribe:${videoId}`, {
  type: 'progress',
  processed: Math.floor((startTime / duration) * 100),
});

3. Idempotent Storage

Always use upsert for captions, keyed by (videoId, startTime):

db.captions.upsert({
  videoId,
  startTime, // Natural composite key
  text: transcript,
});

This way, retries don't create duplicates.

4. Detect Language Dynamically

For multi-language content, detect before selecting STT:

const language = await step.run('detect-language', async () => {
  const sample = audioBuffer.slice(0, 10000); // First 100ms
  return await detectLanguage(sample);
});

if (language === 'hi') return callSarvamSTT(audioBuffer);
return callGeminiSTT(audioBuffer);

Performance Tuning

Chunk size: 30s balances API limits vs. latency
Concurrency: Process up to 3 chunks in parallel, then rate-limit
Timeout: Set Inngest step timeout to 60s (STT can take 20-40s)
Storage: Cache transcript in Redis during processing, write to DB once

Monitoring

Track:

Transcription latency per chunk (should be <2x audio duration)
Retry rate by provider (>5% means reconsider fallback)
Cost per minute of audio (to catch runaway bugs)

posthog.capture('transcribe_chunk', {
  duration: 30,
  latency: Date.now() - start,
  provider: 'sarvam',
  success: true,
});

Summary

A robust transcription pipeline is:

Chunked at the network boundary (30s chunks)
Queued with automatic retries (Inngest)
Streaming to avoid client timeouts
Redundant with fallback providers
Idempotent to survive failures gracefully

Once you have this foundation, adding features (speaker diarization, emotion detection, keyword highlighting) becomes straightforward — they're just additional steps in the pipeline.

Have you hit transcription limits? Drop a comment with your approach.

DEV Community

Building a Robust Real-Time Transcription Pipeline in Next.js: STT, Streaming, and Error Recovery

Building a Robust Real-Time Transcription Pipeline in Next.js: STT, Streaming, and Error Recovery

The Problem

Architecture

Layer 1: Client-Side Streaming

Layer 2: API Route with Queueing

Layer 3: Inngest Worker with Resilience

Key Patterns

1. Retry Logic Without Infinite Loops

2. Streaming Results Early

3. Idempotent Storage

4. Detect Language Dynamically

Performance Tuning

Monitoring

Summary

Top comments (0)