Kyle White

Posted on Apr 2

The Tech Stack Behind an AI Content Automation Tool

#ai #javascript #node #architecture

When people ask what stack powers ClipSpeedAI, the answer is less interesting than the reasoning behind each choice. This post goes through every layer of the stack — frontend, API, processing pipeline, AI integrations, storage, and infrastructure — with the rationale for each technology decision.

The Problem Domain

ClipSpeedAI takes YouTube videos and automatically produces short-form clips (YouTube Shorts, TikTok, Reels). Each job involves:

Downloading a multi-hundred-MB video file
Transcribing audio (30 min video = ~5MB audio file)
AI scoring of candidate clips
Face detection and crop calculation
Re-encoding to vertical format
Generating and burning captions

This is not a simple CRUD app. The system is compute-bound, latency-tolerant (users expect async processing), and requires multiple AI/ML subsystems to cooperate.

Frontend: Next.js

Next.js for the web client. The reasons are straightforward:

SSR for the landing/marketing pages (SEO matters)
React for the interactive editor UI
API routes for lightweight server-side operations
Easy deployment to Vercel

The clip editor UI is React-heavy: a video player with timeline scrubbing, caption overlay preview, and crop position controls. The WebSocket connection for real-time progress updates lives here.

// hooks/useJobProgress.js
import { useEffect, useState } from 'react';

export function useJobProgress(jobId) {
  const [progress, setProgress] = useState({ stage: 'queued', pct: 0 });

  useEffect(() => {
    if (!jobId) return;

    const ws = new WebSocket(`${process.env.NEXT_PUBLIC_WS_URL}/jobs/${jobId}`);

    ws.onmessage = (event) => {
      const data = JSON.parse(event.data);
      setProgress(data);
    };

    return () => ws.close();
  }, [jobId]);

  return progress;
}

Backend: Node.js + Express

The API layer is Node.js with Express. This was an easy call — the team was already on JavaScript, and for an I/O-heavy API (mostly coordinating between Redis, storage, and external APIs), Node's event loop is well-suited.

Route structure:

POST   /api/jobs              → submit new video job
GET    /api/jobs/:id          → get job status + results
GET    /api/jobs/:id/clips    → get processed clip URLs
DELETE /api/jobs/:id          → cancel/delete job
GET    /health                → health check endpoint
WS     /jobs/:id              → WebSocket for progress updates

Queue Layer: BullMQ + Redis

Every processing job goes through a BullMQ queue backed by Redis. Four queues for the four pipeline stages:

import { Queue } from 'bullmq';

const REDIS = { host: process.env.REDIS_HOST, port: 6379 };

export const queues = {
  download:   new Queue('video:download',   { connection: REDIS }),
  transcribe: new Queue('video:transcribe', { connection: REDIS }),
  score:      new Queue('video:score',      { connection: REDIS }),
  encode:     new Queue('video:encode',     { connection: REDIS }),
};

Redis is also used for: API response caching, transcript result caching (avoid re-transcribing the same YouTube video), and session storage.

Video Processing: FFmpeg

FFmpeg is the backbone. Every video operation runs through it:

Segment extraction
Crop and scale to 9:16
Caption burning (ASS subtitle files)
Audio extraction for Whisper
Thumbnail generation

FFmpeg is called via subprocess — fluent-ffmpeg for complex filter chains, execa for simpler invocations.

AI Layer

Three AI services in the pipeline:

OpenAI Whisper for transcription. The whisper-1 model via API. Fast, accurate, and the word-level timestamp mode is essential for caption chunk generation.

GPT-4o for clip scoring. Given a transcript window, GPT-4o evaluates virality potential across five dimensions and returns a composite score. Temperature 0.3, JSON response format, model gpt-4o (not mini — the quality difference is meaningful for creative scoring tasks).

MediaPipe (Python) for face detection. Runs as a Python child process from the Node.js worker. MediaPipe's face detection model, sampled at 1fps, with median smoothing and dead zone stabilization.

// ai/scoring.js - abbreviated interface
export async function scoreClipCandidates(transcriptSegments) {
  const windows = buildScoringWindows(transcriptSegments, 90, 15);
  const scores = await Promise.all(windows.map(scoreWindow));
  return scores.sort((a, b) => b.composite_score - a.composite_score);
}

Storage: Cloudflare R2

All processed clips and intermediate files that need to persist go to Cloudflare R2 (S3-compatible, ~10x cheaper egress than S3). Temp processing files go to /tmp on the server.

import { S3Client, PutObjectCommand, GetObjectCommand } from '@aws-sdk/client-s3';
import { getSignedUrl } from '@aws-sdk/s3-request-presigner';

const r2 = new S3Client({
  region: 'auto',
  endpoint: process.env.R2_ENDPOINT
});

export async function getPresignedDownloadUrl(key, expiresIn = 3600) {
  return getSignedUrl(r2, new GetObjectCommand({
    Bucket: process.env.R2_BUCKET,
    Key: key
  }), { expiresIn });
}

Database: Supabase (PostgreSQL)

Supabase for the primary database. PostgreSQL under the hood, with Supabase's client library for auth, real-time subscriptions, and row-level security.

Job state (queued, processing, complete, failed) lives in a jobs table. Clip metadata (URLs, scores, timestamps) in a clips table. User subscriptions and usage tracking in their respective tables.

Infrastructure: Railway

Everything runs on Railway — the Node.js API, the worker processes, and Redis. Single Dockerfile that installs both Node.js and Python (for MediaPipe).

# railway.toml
[build]
builder = "DOCKERFILE"

[deploy]
healthcheckPath = "/health"
restartPolicyType = "ON_FAILURE"

Payments: Stripe

Stripe for subscription billing. Three tiers: free (3 clips/month), pro ($29/month, 100 clips), and agency ($99/month, unlimited). Stripe webhooks update the subscriptions table in Supabase when plans change.

The Stack Summary

Layer	Technology	Why
Frontend	Next.js	SSR + React
API	Node.js + Express	I/O-bound workloads
Queue	BullMQ + Redis	Async job processing
Video	FFmpeg	Industry standard
Transcription	OpenAI Whisper	Accuracy + timestamps
Scoring	GPT-4o	Best creative reasoning
Face detection	MediaPipe (Python)	Fast CPU inference
Storage	Cloudflare R2	Cheap egress
Database	Supabase/PostgreSQL	Managed, with auth
Infrastructure	Railway	Simple PaaS
Payments	Stripe	Standard

The full product — ClipSpeedAI — runs on this stack today. No Kubernetes, no microservices mesh, no infrastructure team. Just a well-organized monolith with a sensible queue architecture.

The lesson from building this: boring technology choices made correctly outperform exciting technology choices made prematurely. The hosted version of this stack is available at ClipSpeedAI for teams who want the output — AI-scored, captioned, vertical clips — without assembling each layer themselves.

DEV Community