Building a Robust Real-Time Transcription Pipeline in Next.js: STT, Streaming, and Error Recovery
Transcribing video and audio at scale is deceptively complex. You're not just calling an STT API and returning the result — you're managing file chunking, concurrent requests, partial failures, and real-time updates. Here's how we built a production-grade transcription pipeline in Next.js.
The Problem
When processing user-uploaded videos, we needed:
- Streaming results as transcription completes (not waiting for the whole file)
- Automatic retry on API failures without losing progress
- Chunk-based processing to avoid timeout and memory limits
- Multi-language support with fallback providers
- Real-time UI updates as captions are generated
Architecture
We use a three-layer approach:
- Client: Stream audio blobs in 30-second chunks
- API Route: Queue chunks, handle retries, emit progress events
- Background Worker: Heavy lifting with Inngest (idempotent steps, automatic retries)
Layer 1: Client-Side Streaming
// Send 30-second chunks to the API
async function transcribeChunked(file: File) {
const CHUNK_SIZE = 30 * 1000; // 30 seconds
const duration = await getAudioDuration(file);
for (let start = 0; start < duration; start += CHUNK_SIZE) {
const blob = await extractAudioChunk(file, start, start + CHUNK_SIZE);
const formData = new FormData();
formData.append('audio', blob);
formData.append('startTime', start.toString());
const response = await fetch('/api/transcribe', {
method: 'POST',
body: formData,
});
// Stream events back
const reader = response.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
const lines = text.split('\n').filter(Boolean);
for (const line of lines) {
const event = JSON.parse(line);
if (event.type === 'caption') {
updateCaption(event.text, event.startTime);
}
}
}
}
}
Layer 2: API Route with Queueing
// app/api/transcribe/route.ts
import { inngest } from '@/lib/inngest';
export async function POST(req: Request) {
const formData = await req.formData();
const blob = formData.get('audio') as Blob;
const startTime = parseInt(formData.get('startTime') as string);
const videoId = formData.get('videoId') as string;
// Trigger background job, return streaming response
const runId = await inngest.send({
name: 'video/transcribe-chunk',
data: { videoId, audioBuffer: await blob.arrayBuffer(), startTime },
});
// Stream progress back to client
return new Response(
streamTranscriptionProgress(videoId, runId),
{ headers: { 'Content-Type': 'application/x-ndjson' } }
);
}
Layer 3: Inngest Worker with Resilience
// lib/inngest.ts
inngest.createFunction(
{ id: 'transcribe-chunk', priority: { run: 10 } },
{ event: 'video/transcribe-chunk' },
async ({ event, step }) => {
const { videoId, audioBuffer, startTime } = event.data;
// Step 1: Try primary STT provider (Sarvam for Indian languages)
let transcript = await step.run('sarvam-stt', async () => {
try {
return await callSarvamSTT(audioBuffer);
} catch (err) {
if (err.status === 429) throw err; // Retry
return null; // Fallback
}
});
// Step 2: Fallback to Gemini if primary fails
if (!transcript) {
transcript = await step.run('gemini-stt', async () => {
return await callGeminiSTT(audioBuffer);
});
}
// Step 3: Store result (idempotent)
await step.run('store-caption', async () => {
await db.captions.upsert({
videoId,
startTime,
text: transcript,
});
});
// Step 4: Emit to client
await step.run('emit-progress', async () => {
await publishEvent(`transcribe:${videoId}`, {
type: 'caption',
text: transcript,
startTime,
});
});
}
);
Key Patterns
1. Retry Logic Without Infinite Loops
Inngest handles retries automatically, but we distinguish:
- 429 (rate limit): Throw to trigger exponential backoff
- Network error: Auto-retry (Inngest)
- Invalid audio: Fail gracefully, skip chunk
if (err.message.includes('invalid audio format')) {
await db.captions.create({ videoId, startTime, text: '[SILENT]' });
return; // Don't retry
}
throw err; // Let Inngest retry
2. Streaming Results Early
Don't wait for all chunks. Emit captions as they complete:
await publishEvent(`transcribe:${videoId}`, {
type: 'progress',
processed: Math.floor((startTime / duration) * 100),
});
3. Idempotent Storage
Always use upsert for captions, keyed by (videoId, startTime):
db.captions.upsert({
videoId,
startTime, // Natural composite key
text: transcript,
});
This way, retries don't create duplicates.
4. Detect Language Dynamically
For multi-language content, detect before selecting STT:
const language = await step.run('detect-language', async () => {
const sample = audioBuffer.slice(0, 10000); // First 100ms
return await detectLanguage(sample);
});
if (language === 'hi') return callSarvamSTT(audioBuffer);
return callGeminiSTT(audioBuffer);
Performance Tuning
- Chunk size: 30s balances API limits vs. latency
- Concurrency: Process up to 3 chunks in parallel, then rate-limit
- Timeout: Set Inngest step timeout to 60s (STT can take 20-40s)
- Storage: Cache transcript in Redis during processing, write to DB once
Monitoring
Track:
- Transcription latency per chunk (should be <2x audio duration)
- Retry rate by provider (>5% means reconsider fallback)
- Cost per minute of audio (to catch runaway bugs)
posthog.capture('transcribe_chunk', {
duration: 30,
latency: Date.now() - start,
provider: 'sarvam',
success: true,
});
Summary
A robust transcription pipeline is:
- Chunked at the network boundary (30s chunks)
- Queued with automatic retries (Inngest)
- Streaming to avoid client timeouts
- Redundant with fallback providers
- Idempotent to survive failures gracefully
Once you have this foundation, adding features (speaker diarization, emotion detection, keyword highlighting) becomes straightforward — they're just additional steps in the pipeline.
Have you hit transcription limits? Drop a comment with your approach.
Top comments (0)