DEV Community

Kyle White
Kyle White

Posted on

The Tech Stack Behind an AI Content Automation Tool

When people ask what stack powers ClipSpeedAI, the answer is less interesting than the reasoning behind each choice. This post goes through every layer of the stack — frontend, API, processing pipeline, AI integrations, storage, and infrastructure — with the rationale for each technology decision.

The Problem Domain

ClipSpeedAI takes YouTube videos and automatically produces short-form clips (YouTube Shorts, TikTok, Reels). Each job involves:

  • Downloading a multi-hundred-MB video file
  • Transcribing audio (30 min video = ~5MB audio file)
  • AI scoring of candidate clips
  • Face detection and crop calculation
  • Re-encoding to vertical format
  • Generating and burning captions

This is not a simple CRUD app. The system is compute-bound, latency-tolerant (users expect async processing), and requires multiple AI/ML subsystems to cooperate.

Frontend: Next.js

Next.js for the web client. The reasons are straightforward:

  • SSR for the landing/marketing pages (SEO matters)
  • React for the interactive editor UI
  • API routes for lightweight server-side operations
  • Easy deployment to Vercel

The clip editor UI is React-heavy: a video player with timeline scrubbing, caption overlay preview, and crop position controls. The WebSocket connection for real-time progress updates lives here.

// hooks/useJobProgress.js
import { useEffect, useState } from 'react';

export function useJobProgress(jobId) {
  const [progress, setProgress] = useState({ stage: 'queued', pct: 0 });

  useEffect(() => {
    if (!jobId) return;

    const ws = new WebSocket(`${process.env.NEXT_PUBLIC_WS_URL}/jobs/${jobId}`);

    ws.onmessage = (event) => {
      const data = JSON.parse(event.data);
      setProgress(data);
    };

    return () => ws.close();
  }, [jobId]);

  return progress;
}
Enter fullscreen mode Exit fullscreen mode

Backend: Node.js + Express

The API layer is Node.js with Express. This was an easy call — the team was already on JavaScript, and for an I/O-heavy API (mostly coordinating between Redis, storage, and external APIs), Node's event loop is well-suited.

Route structure:

POST   /api/jobs              → submit new video job
GET    /api/jobs/:id          → get job status + results
GET    /api/jobs/:id/clips    → get processed clip URLs
DELETE /api/jobs/:id          → cancel/delete job
GET    /health                → health check endpoint
WS     /jobs/:id              → WebSocket for progress updates
Enter fullscreen mode Exit fullscreen mode

Queue Layer: BullMQ + Redis

Every processing job goes through a BullMQ queue backed by Redis. Four queues for the four pipeline stages:

import { Queue } from 'bullmq';

const REDIS = { host: process.env.REDIS_HOST, port: 6379 };

export const queues = {
  download:   new Queue('video:download',   { connection: REDIS }),
  transcribe: new Queue('video:transcribe', { connection: REDIS }),
  score:      new Queue('video:score',      { connection: REDIS }),
  encode:     new Queue('video:encode',     { connection: REDIS }),
};
Enter fullscreen mode Exit fullscreen mode

Redis is also used for: API response caching, transcript result caching (avoid re-transcribing the same YouTube video), and session storage.

Video Processing: FFmpeg

FFmpeg is the backbone. Every video operation runs through it:

  • Segment extraction
  • Crop and scale to 9:16
  • Caption burning (ASS subtitle files)
  • Audio extraction for Whisper
  • Thumbnail generation

FFmpeg is called via subprocess — fluent-ffmpeg for complex filter chains, execa for simpler invocations.

AI Layer

Three AI services in the pipeline:

OpenAI Whisper for transcription. The whisper-1 model via API. Fast, accurate, and the word-level timestamp mode is essential for caption chunk generation.

GPT-4o for clip scoring. Given a transcript window, GPT-4o evaluates virality potential across five dimensions and returns a composite score. Temperature 0.3, JSON response format, model gpt-4o (not mini — the quality difference is meaningful for creative scoring tasks).

MediaPipe (Python) for face detection. Runs as a Python child process from the Node.js worker. MediaPipe's face detection model, sampled at 1fps, with median smoothing and dead zone stabilization.

// ai/scoring.js - abbreviated interface
export async function scoreClipCandidates(transcriptSegments) {
  const windows = buildScoringWindows(transcriptSegments, 90, 15);
  const scores = await Promise.all(windows.map(scoreWindow));
  return scores.sort((a, b) => b.composite_score - a.composite_score);
}
Enter fullscreen mode Exit fullscreen mode

Storage: Cloudflare R2

All processed clips and intermediate files that need to persist go to Cloudflare R2 (S3-compatible, ~10x cheaper egress than S3). Temp processing files go to /tmp on the server.

import { S3Client, PutObjectCommand, GetObjectCommand } from '@aws-sdk/client-s3';
import { getSignedUrl } from '@aws-sdk/s3-request-presigner';

const r2 = new S3Client({
  region: 'auto',
  endpoint: process.env.R2_ENDPOINT
});

export async function getPresignedDownloadUrl(key, expiresIn = 3600) {
  return getSignedUrl(r2, new GetObjectCommand({
    Bucket: process.env.R2_BUCKET,
    Key: key
  }), { expiresIn });
}
Enter fullscreen mode Exit fullscreen mode

Database: Supabase (PostgreSQL)

Supabase for the primary database. PostgreSQL under the hood, with Supabase's client library for auth, real-time subscriptions, and row-level security.

Job state (queued, processing, complete, failed) lives in a jobs table. Clip metadata (URLs, scores, timestamps) in a clips table. User subscriptions and usage tracking in their respective tables.

Infrastructure: Railway

Everything runs on Railway — the Node.js API, the worker processes, and Redis. Single Dockerfile that installs both Node.js and Python (for MediaPipe).

# railway.toml
[build]
builder = "DOCKERFILE"

[deploy]
healthcheckPath = "/health"
restartPolicyType = "ON_FAILURE"
Enter fullscreen mode Exit fullscreen mode

Payments: Stripe

Stripe for subscription billing. Three tiers: free (3 clips/month), pro ($29/month, 100 clips), and agency ($99/month, unlimited). Stripe webhooks update the subscriptions table in Supabase when plans change.

The Stack Summary

Layer Technology Why
Frontend Next.js SSR + React
API Node.js + Express I/O-bound workloads
Queue BullMQ + Redis Async job processing
Video FFmpeg Industry standard
Transcription OpenAI Whisper Accuracy + timestamps
Scoring GPT-4o Best creative reasoning
Face detection MediaPipe (Python) Fast CPU inference
Storage Cloudflare R2 Cheap egress
Database Supabase/PostgreSQL Managed, with auth
Infrastructure Railway Simple PaaS
Payments Stripe Standard

The full product — ClipSpeedAI — runs on this stack today. No Kubernetes, no microservices mesh, no infrastructure team. Just a well-organized monolith with a sensible queue architecture.

The lesson from building this: boring technology choices made correctly outperform exciting technology choices made prematurely. The hosted version of this stack is available at ClipSpeedAI for teams who want the output — AI-scored, captioned, vertical clips — without assembling each layer themselves.

Top comments (0)