Building an AI Video Clipping Pipeline: Architecture, Tradeoffs, and What We Learned

#ai #javascript #node #ffmpeg

Building an AI Video Clipping Pipeline: Architecture, Tradeoffs, and What We Learned

Processing video at scale is one of those problems that looks straightforward until you're in production at 3 AM watching a queue pile up. This is the actual architecture behind ClipSpeedAI — what we built, why we made the tradeoffs we did, and what we'd do differently.

The Core Challenge

The product is simple to describe: take a YouTube URL, return 8-12 vertically reformatted short-form clips with captions and virality scores, in under 15 minutes.

The technical reality is a multi-stage async pipeline with three fundamentally different processing environments — JavaScript/Node.js for orchestration and API serving, Python for machine learning inference, and FFmpeg for video encoding — all needing to coordinate without stepping on each other.

Stage 1: Job Ingestion and Queue Management

Every video job enters through a REST API endpoint that validates the URL, creates a job record in Supabase, and pushes it to a Bull queue backed by Redis.

// Job creation
const job = await videoQueue.add('process-video', {
  jobId: supabaseRecord.id,
  url: youtubeUrl,
  userId: user.id,
  options: { maxClips: 12, minDuration: 15, maxDuration: 90 }
}, {
  attempts: 3,
  backoff: { type: 'exponential', delay: 5000 },
  removeOnComplete: 50,
  removeOnFail: 20
});

Bull handles retry logic automatically, which matters for YouTube download failures — bot detection, rate limiting, and network timeouts are the most common failure modes in production.

Stage 2: Video Download

We use yt-dlp via Node.js child_process.spawn with a streaming pipe directly into the processing stage. Early versions downloaded the full file first, which added 3-5 minutes of latency for long videos. Streaming cut that to near-zero.

One critical lesson: never proxy the video download itself. Proxy the metadata API calls to avoid bot detection on info fetching, but stream the actual video bytes directly. Routing gigabytes of video through a proxy burns bandwidth, slows processing, and has a high failure rate. Residential proxies are for tiny API calls only.

Stage 3: Transcript Extraction and LLM Analysis

This is where the virality scoring happens. We pull the transcript via YouTube's caption API, normalize the timing data, and pass it to GPT-4o with a structured evaluation prompt.

The prompt is the core IP. It evaluates each 30-120 second segment for:

Hook strength (first sentence)
Narrative completeness (does it land without external context)
Information density (signal-to-noise ratio)
Emotional content (based on word patterns and sentence structure)

The output is structured JSON with clip candidates ranked by composite virality score. This takes 8-15 seconds per video depending on transcript length.

Stage 4: The Python Bridge — MediaPipe Face Detection

This is where architectures get messy. MediaPipe runs in Python. The rest of the stack is Node.js. The cleanest solution we found: a persistent Python subprocess with a simple JSON-over-stdin/stdout IPC protocol.

# Python side — runs as persistent child process
import sys, json
import mediapipe as mp

detector = mp.solutions.face_detection.FaceDetection(
    model_selection=1, min_detection_confidence=0.5
)

for line in sys.stdin:
    request = json.loads(line)
    # process frame, return face positions
    result = process_frame(request['frame_data'])
    sys.stdout.write(json.dumps(result) + '\n')
    sys.stdout.flush()

The alternative — spawning a new Python process per clip — was 40% slower due to model loading overhead. Persistent process with IPC is the right pattern here.

Railway threading gotcha: MediaPipe's default configuration tries to spawn multiple pthread workers. Railway's container environment restricts this, causing silent crashes. The fix is forcing single-threaded mode via environment variable before any MediaPipe import:

import os
os.environ['MEDIAPIPE_DISABLE_GPU'] = '1'
os.environ['OMP_NUM_THREADS'] = '1'

Stage 5: FFmpeg Encoding

With face position data in hand, FFmpeg applies a dynamic crop filter. The crop window follows the face with exponential smoothing to prevent jitter:

function buildCropFilter(facePositions, outputWidth, outputHeight) {
  const keyframes = facePositions
    .map(({time, cx}) => {
      const cropX = Math.max(0, Math.min(
        cx * inputWidth - outputWidth / 2,
        inputWidth - outputWidth
      ));
      return `${time}_${Math.round(cropX)}`;
    })
    .join(':');

  return `crop=${outputWidth}:${outputHeight}:${keyframes}`;
}

Caption burn-in happens in the same FFmpeg pass using drawtext filters — combining crop + caption in one encode rather than two separate passes reduces processing time by roughly 35%.

What We'd Do Differently

Separate the Python ML service from the start. Running a sidecar process is workable but fragile under load. A proper FastAPI microservice with a job-specific endpoint would be cleaner and more independently scalable.

Use S3 presigned URLs for client delivery from day one. Early versions served processed video through the API server. Wrong call. Let S3 handle delivery, let your API handle orchestration.

Build the monitoring dashboard before the product. We flew blind for too long. Knowing your queue depth, average job duration, and failure rates by stage is essential for production.

ClipSpeedAI is in production, processing hundreds of videos weekly. If you're building something similar or want to dig into any of these stages further, drop a comment.