Processing Video at Scale: Lessons From Building ClipSpeedAI

#video #architecture #node #ai

Building ClipSpeedAI meant going from "I can process one video" to "I can process hundreds of videos reliably without breaking anything." That transition is where most of the real engineering work happens. This post is a direct account of the lessons learned — what broke, what we changed, and what the architecture looks like now.

Lesson 1: FFmpeg Is Not Stateless

The first version of the pipeline ran FFmpeg jobs inline in API request handlers. It worked for demos. It failed immediately under real load because:

FFmpeg is CPU-bound. Two concurrent jobs on a 2-core server means 200% CPU utilization, which causes the server to become unresponsive to incoming requests.
FFmpeg can produce partial output files if killed mid-run (process crash, timeout). Partial MP4s are corrupt and unplayable.
There's no backpressure — the API accepts requests faster than FFmpeg can process them.

Fix: Move all FFmpeg work to background workers. Use a job queue (Bull + Redis) with concurrency=2 on encode workers. The API just enqueues jobs and returns a job ID immediately.

// Before (bad)
app.post('/process', async (req, res) => {
  const clip = await processVideo(req.body.url);  // blocks for 60s
  res.json({ clip });
});

// After (correct)
app.post('/process', async (req, res) => {
  const job = await encodeQueue.add('encode', req.body);
  res.json({ jobId: job.id });  // returns immediately
});

Lesson 2: Temp Files Accumulate Silently

Video files are large. A 1080p 30-minute YouTube video is 2-4GB. During processing, you might have: the original download, an extracted segment, a face-detected version, a captioned version, and an output clip — all on disk simultaneously. That's 10-15GB for one job.

With 10 concurrent jobs, you'll fill a 100GB disk in an hour and bring the server down.

Fix: A cron job that deletes temp files older than 30 minutes:

// cleanup.js
import fs from 'fs/promises';
import path from 'path';

export async function cleanTmpFiles(directory = '/tmp/clips', maxAgeMs = 30 * 60 * 1000) {
  const now = Date.now();
  const entries = await fs.readdir(directory, { withFileTypes: true });

  for (const entry of entries) {
    if (!entry.isFile()) continue;

    const filePath = path.join(directory, entry.name);
    const stat = await fs.stat(filePath);

    if (now - stat.mtimeMs > maxAgeMs) {
      await fs.unlink(filePath);
    }
  }
}

Run this every 15 minutes via setInterval or a cron job. Don't rely on inline cleanup after jobs — retry logic means a job might re-use a file, and a crash can skip cleanup entirely.

Lesson 3: Whisper API Has Rate Limits

The Whisper API has per-minute and per-day rate limits. Early on, batching several jobs simultaneously meant multiple Whisper calls firing at once, hitting the rate limit, and causing a cascade of failures.

Fix: Rate-limit the transcription worker:

const transcribeWorker = new Worker('video:transcribe', transcribeProcessor, {
  connection: redisConnection,
  concurrency: 2,
  limiter: {
    max: 5,           // 5 transcription jobs
    duration: 60_000  // per minute
  }
});

Also: cache transcription results. If two users process the same YouTube video, the audio is identical — transcribe once and cache the result keyed by video ID.

const CACHE_TTL = 7 * 24 * 60 * 60;  // 7 days in seconds

export async function transcribeWithCache(videoId, audioPath) {
  const cacheKey = `transcript:${videoId}`;
  const cached = await redis.get(cacheKey);

  if (cached) return JSON.parse(cached);

  const result = await transcribeAudio(audioPath);
  await redis.setex(cacheKey, CACHE_TTL, JSON.stringify(result));
  return result;
}

Lesson 4: Python Subprocess Crashes Are Silent

When a Python subprocess fails with a segfault or pthread_create error (common with MediaPipe in constrained containers), it exits with code -11 (SIGSEGV). If your Node.js wrapper just checks for exit code 0, you'll get a generic "command failed" error with no useful information.

Fix: Capture and log stderr explicitly:

const { exitCode, stdout, stderr } = await execa('python3', [...args], {
  reject: false
});

if (exitCode !== 0) {
  logger.error('Python subprocess error', {
    exitCode,
    stderr: stderr.slice(-2000),  // last 2000 chars (tracebacks can be long)
    signal: exitCode === null ? 'process killed/segfaulted' : undefined
  });
}

And always have a fallback for ML failures:

let cropX;
try {
  const tracking = await runFaceDetection(videoPath);
  cropX = computeCropX(tracking);
} catch (err) {
  logger.warn('Face detection failed, using center crop', { err: err.message });
  cropX = Math.floor((sourceWidth - cropWidth) / 2);
}

Lesson 5: GPT-4o Clip Scoring Is Expensive If Misused

Initially, we scored every possible 60-second window in a video. For a 30-minute video, that's ~30 windows × ~800 tokens each × GPT-4o pricing. The cost adds up fast.

Fix: Use YouTube's auto-captions (yt-dlp --write-auto-subs) for a free first pass. Filter out segments where the transcript density is too low (music, silence, ambient sound). Only send the remaining top candidates to GPT-4o for scoring.

function filterCandidateSegments(segments) {
  return segments.filter(seg => {
    const wordCount = seg.text.split(' ').length;
    const duration = seg.end - seg.start;
    const wordsPerSecond = wordCount / duration;

    // Filter out music/silence (< 1 word/sec) and very fast speech (> 5 words/sec, often poor quality)
    return wordsPerSecond >= 1 && wordsPerSecond <= 5;
  });
}

This cut GPT-4o API spend by ~60% without meaningfully affecting clip quality.

Lesson 6: User-Facing Progress Updates Need WebSockets

Polling /status/:jobId every 2 seconds from the client is wasteful and creates unnecessary load. WebSockets push progress updates in real-time with no polling overhead.

// In the worker
await job.updateProgress({ stage: 'transcribing', pct: 30 });

// In the API, relay to WebSocket
queueEvents.on('progress', ({ jobId, data }) => {
  const ws = activeConnections.get(jobId);
  if (ws) ws.send(JSON.stringify({ type: 'progress', ...data }));
});

What the Architecture Looks Like Now

API (Express) → Bull Queue (Redis) → Workers (Node.js)
                                        ├── Download worker (concurrency: 6)
                                        ├── Transcribe worker (concurrency: 2, rate-limited)
                                        ├── Score worker (concurrency: 3)
                                        └── Encode worker (concurrency: 2)
                                              └── Python subprocess (face detection)

Everything async, everything retryable, everything monitored.

ClipSpeedAI runs on this architecture today. The system processes videos from submission to downloadable clips without manual intervention. The lessons above represent the gap between a working prototype and a production service.

The single most impactful change was the first one: moving FFmpeg to background workers. Everything else is optimization and hardening. If you want to skip the hard-won lessons and go straight to working clips, ClipSpeedAI has the full pipeline running in production today.