The Architecture Behind an AI Video Processing Pipeline

#video #architecture #node #ai

Building a video processing service that handles everything from YouTube download to AI-scored, captioned, face-tracked vertical clips involves a lot of moving parts. This post is a straight-up architecture breakdown — the components, how they talk to each other, and the design decisions that actually matter at scale.

This is the architecture running ClipSpeedAI.

System Overview

At the highest level, the pipeline is:

User submits YouTube URL
    → Download job queued
    → Video downloaded (yt-dlp)
    → Audio extracted (FFmpeg)
    → Transcription (Whisper API)
    → Clip scoring (GPT-4o)
    → Face detection (MediaPipe, Python child process)
    → Clip extraction + crop (FFmpeg)
    → Caption generation + burn (Whisper segments → FFmpeg drawtext)
    → Output clips uploaded to storage
    → User notified via WebSocket

Each stage is a discrete job in a Bull queue. Stages are not chained synchronously — they produce events that trigger the next stage. This means any stage can fail and retry independently.

Component Map

┌─────────────────────────────────────┐
│           Node.js API               │
│  Express + Bull Queue + WebSocket   │
└──────────┬──────────────────────────┘
           │
    ┌──────▼──────┐       ┌──────────────────┐
    │  Bull Queue  │       │  Redis (BullMQ)   │
    │  Workers x4  │◄─────►│  Job state store  │
    └──────┬───────┘       └──────────────────┘
           │
    ┌──────▼──────────────────────────────────┐
    │           Worker Processes               │
    │  download | transcribe | score | encode  │
    └──────┬─────────────────────────────────┘
           │
    ┌──────▼──────┐       ┌──────────────────────┐
    │  Python IPC  │       │  External APIs        │
    │  (face det.) │       │  OpenAI, Whisper      │
    └─────────────┘       └──────────────────────┘
           │
    ┌──────▼──────────────────────┐
    │   Object Storage (S3/R2)    │
    │   Input videos + Output     │
    └─────────────────────────────┘

The Queue Structure

Four separate Bull queues, not one monolithic queue:

// queues.js
import { Queue } from 'bullmq';
import { redisConnection } from './redis.js';

export const downloadQueue = new Queue('video-download', { connection: redisConnection });
export const transcribeQueue = new Queue('video-transcribe', { connection: redisConnection });
export const scoreQueue = new Queue('clip-score', { connection: redisConnection });
export const encodeQueue = new Queue('clip-encode', { connection: redisConnection });

Why separate queues? Different concurrency limits. Download jobs are I/O-bound and can run 8 concurrent. Encode jobs are CPU-bound — max 2 concurrent on a 4-core server, or you thrash. Transcription jobs are API-rate-limited and run at 4 concurrent.

// workers.js
import { Worker } from 'bullmq';

const downloadWorker = new Worker('video-download', downloadProcessor, {
  connection: redisConnection,
  concurrency: 8
});

const encodeWorker = new Worker('clip-encode', encodeProcessor, {
  connection: redisConnection,
  concurrency: 2  // CPU-bound, keep low
});

Job State and Progress

Job state lives in Redis via Bull. Each job carries its full context as the job data:

await encodeQueue.add('encode-clip', {
  jobId: uuid,
  userId: user.id,
  videoPath: '/tmp/videos/abc123.mp4',
  segments: [
    { start: 45.2, end: 89.7, cropX: 710, score: 8.4 }
  ],
  options: {
    addCaptions: true,
    outputFormat: 'vertical_1080'
  }
}, {
  attempts: 3,
  backoff: { type: 'exponential', delay: 5000 },
  removeOnComplete: { count: 100 },
  removeOnFail: { count: 50 }
});

Progress updates flow through Bull's progress API back to WebSocket listeners:

// Inside a worker processor
async function encodeProcessor(job) {
  await job.updateProgress(10);  // downloaded
  // ... encode ...
  await job.updateProgress(70);  // encoded
  // ... upload ...
  await job.updateProgress(100); // done
}

Python IPC Pattern

The face detection step runs as a Python child process. The IPC contract is simple: Node passes file paths as CLI arguments, Python writes JSON to a temp file, Node reads it back.

// python_runner.js
import { execa } from 'execa';
import { randomUUID } from 'crypto';
import fs from 'fs/promises';
import path from 'path';

export async function runPythonDetector(videoPath) {
  const outputPath = path.join('/tmp', `detections_${randomUUID()}.json`);

  await execa('python3', ['./python/face_detector.py',
    '--input', videoPath,
    '--output', outputPath
  ], { timeout: 90_000 });

  const data = JSON.parse(await fs.readFile(outputPath, 'utf8'));
  await fs.unlink(outputPath);
  return data;
}

File-based IPC is less elegant than stdout/stdin, but it's more reliable for large payloads (thousands of frame detection results) and easier to debug — you can cat the output file if something goes wrong.

Storage Architecture

All intermediate files use /tmp on the server. Final output clips go to Cloudflare R2 (S3-compatible). The pattern:

// storage.js
import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';
import fs from 'fs';

const r2 = new S3Client({
  region: 'auto',
  endpoint: process.env.R2_ENDPOINT,
  credentials: {
    accessKeyId: process.env.R2_ACCESS_KEY,
    secretAccessKey: process.env.R2_SECRET_KEY
  }
});

export async function uploadClip(localPath, key) {
  await r2.send(new PutObjectCommand({
    Bucket: process.env.R2_BUCKET,
    Key: key,
    Body: fs.createReadStream(localPath),
    ContentType: 'video/mp4'
  }));

  return `${process.env.R2_PUBLIC_URL}/${key}`;
}

Temp files are cleaned up by a cron job every 30 minutes rather than inline after each job. This prevents race conditions when jobs retry.

Error Handling Philosophy

The pipeline is designed around "degrade gracefully, never fail silently":

Face detection fails → fall back to center crop
GPT-4o scoring fails → fall back to heuristic scoring (longest speech segments)
Caption generation fails → output clip without captions
Upload fails → retry 3 times with exponential backoff, then alert

No stage should cause the entire job to fail unless it's the core encode step itself.

Observability

Every worker emits structured logs:

logger.info('encode_complete', {
  jobId: job.id,
  userId: job.data.userId,
  duration_ms: Date.now() - startTime,
  clip_count: outputClips.length,
  total_size_bytes: totalSize
});

These feed into a Datadog dashboard. The three metrics that matter most: job queue depth, p95 encode time, and failure rate by stage.

ClipSpeedAI has processed hundreds of videos on this architecture. The key insight is that each stage should be independently retryable and independently scalable. That's what makes it robust without being complicated.

For the WebSocket notification layer and how clients receive real-time progress updates, see the dedicated article on WebSocket integration in this series. The complete pipeline described here is available as a hosted tool at ClipSpeedAI for teams who want the output without building the infrastructure.