DEV Community

Carlos Chao(El Frontend)
Carlos Chao(El Frontend) Subscriber

Posted on • Edited on

I Built My Own AI-Powered Video Clipping Tool Because the Paid Ones Were Too Expensive(And the AI can do it for me)

Video Wizard

The problem?

Most tools that do this are paid. And not just a little — quite expensive.

Now, I'm not being cheap. But I thought:

"An engineer built this tool. I'm an engineer too. So I can build my own."

And then I reflected:

"Well... not just me. Me + Claude."

And that's how Video Wizard was born — an open-source, AI-powered tool for turning long-form video into short-form content with subtitles, smart cropping, and professional templates.

In this post, I'll walk you through the architecture, the tech decisions, and what I learned building a production-grade video processing pipeline as a side project.


What Does It Do?

Here's the full pipeline:

  1. Upload a video (or paste a YouTube URL)
  2. AI transcribes the audio using OpenAI Whisper
  3. GPT-4o analyzes the transcript and detects the most viral moments (scored 0-100)
  4. Smart cropping with face detection (MediaPipe + OpenCV) reframes for vertical
  5. Edit subtitles in a visual editor — merge, split, clean filler words
  6. Pick a template — 9 professional caption styles built as React components
  7. Render the final video with Remotion
  8. Download — ready for TikTok, Reels, or YouTube Shorts

But it's not just a clip extractor. There's also a standalone Subtitle Generator that skips the AI analysis and goes straight from upload to rendered subtitles, and a Content Intelligence tool for analyzing transcripts without video.


The Architecture: Three Services, One Goal

I chose a microservices monorepo approach with Turborepo. Three independent services, each doing what it does best:

graph TB
    subgraph Frontend["Next.js 16 App — Port 3000"]
        UI["React 19 UI<br/>Tailwind + shadcn/ui"]
        API["API Routes<br/>(thin HTTP layer)"]
        Services["Service Layer<br/>(business logic)"]
        UI --> API
        API --> Services
    end

    subgraph Python["Python Engine — Port 8000"]
        Whisper["OpenAI Whisper<br/>Transcription"]
        Face["MediaPipe<br/>Face Detection"]
        FFmpeg["FFmpeg<br/>Video Processing"]
        YT["yt-dlp<br/>YouTube Download"]
    end

    subgraph Remotion["Remotion Server — Port 3001"]
        Templates["9 React Templates<br/>viral · minimal · hormozi<br/>mrbeast · modern · highlight..."]
        Queue["Job Queue<br/>Async Rendering"]
        Renderer["Remotion Renderer<br/>React → MP4"]
        Templates --> Renderer
        Queue --> Renderer
    end

    subgraph External["External APIs"]
        OpenAI["OpenAI<br/>GPT-4o + Whisper API"]
        YouTube["YouTube"]
    end

    Services -->|"transcribe / upload<br/>face detection"| Python
    Services -->|"render job<br/>poll status"| Remotion
    Services -->|"viral analysis<br/>structured output"| OpenAI
    YT -->|"download"| YouTube

    style Frontend fill:#0a0a0a,stroke:#3b82f6,stroke-width:2px,color:#e2e8f0
    style Python fill:#0a0a0a,stroke:#22c55e,stroke-width:2px,color:#e2e8f0
    style Remotion fill:#0a0a0a,stroke:#a855f7,stroke-width:2px,color:#e2e8f0
    style External fill:#0a0a0a,stroke:#f59e0b,stroke-width:2px,color:#e2e8f0
Enter fullscreen mode Exit fullscreen mode

And here's the full video processing pipeline — from upload to download:

flowchart LR
    A["Upload Video<br/>or YouTube URL"] --> B["Python Engine"]
    B --> C{"Whisper<br/>Transcription"}
    C --> D["Subtitles<br/>(seconds)"]
    D --> E["GPT-4o<br/>Viral Analysis"]
    E --> F["Viral Clips<br/>scored 0-100"]
    D --> G["Subtitle Editor<br/>(milliseconds)"]
    G --> H["Select Template<br/>+ Aspect Ratio"]
    H --> I["Remotion Server"]
    I --> J["React Components<br/>→ Video Frames"]
    J --> K["FFmpeg Encoding<br/>→ MP4"]
    K --> L["Download"]

    style A fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style B fill:#1e293b,stroke:#22c55e,color:#e2e8f0
    style C fill:#1e293b,stroke:#22c55e,color:#e2e8f0
    style D fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style E fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style F fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style G fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style H fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style I fill:#1e293b,stroke:#a855f7,color:#e2e8f0
    style J fill:#1e293b,stroke:#a855f7,color:#e2e8f0
    style K fill:#1e293b,stroke:#a855f7,color:#e2e8f0
    style L fill:#1e293b,stroke:#10b981,color:#e2e8f0
Enter fullscreen mode Exit fullscreen mode

Why three services?

  • Python has the best video/ML libraries (FFmpeg, MediaPipe, Whisper). There's no good alternative in the JS ecosystem for face detection + smart cropping.
  • Remotion needs its own server because it bundles React components into video frames — it's resource-intensive and benefits from isolated rendering.
  • Next.js handles the UI, API routing, and orchestration. It's the glue.

The Key Insight: Video as React Components

This was the "aha" moment for me.

I chose Remotion because it lets you treat video like a UI. Components, props, composition — but applied to audiovisual content.

Here's what a caption template looks like. It's just a React component:

export function HormoziTemplate({ currentSegment, isActive, brandKit }: CaptionTemplateProps) {
  const frame = useCurrentFrame();
  const { fps } = useVideoConfig();

  if (!isActive || !currentSegment) return null;

  const words = currentSegment.text.split(' ');

  return (
    <AbsoluteFill style={{ justifyContent: 'flex-end', alignItems: 'center' }}>
      <div style={{ display: 'flex', gap: 12 }}>
        {words.map((word, i) => {
          const wordDelay = i * 4;
          const scale = interpolate(frame, [wordDelay, wordDelay + 8], [0.5, 1.15]);
          const translateY = interpolate(frame, [wordDelay, wordDelay + 8], [40, 0]);

          return (
            <span key={i} style={{
              transform: `translateY(${translateY}px) scale(${scale})`,
              color: brandKit?.textColor ?? '#FFFFFF',
              fontFamily: brandKit?.fontFamily ?? 'Montserrat',
            }}>
              {word}
            </span>
          );
        })}
      </div>
    </AbsoluteFill>
  );
}
Enter fullscreen mode Exit fullscreen mode

Every frame is a function of time. useCurrentFrame() gives you the current frame number, and interpolate() maps it to any CSS property. No imperative video APIs, no timelines — just declarative React.

I built 9 templates this way: from clean minimal to high-energy mrbeast style, each with different animations, typography, and color schemes. Switching templates is just changing a prop.

Here's how the Remotion composition layers work:

graph TB
    subgraph Remotion["Remotion Render Pipeline"]
        Frame["useCurrentFrame()<br/>frame = 450, fps = 30"]
        Time["currentTime = frame / fps<br/>= 15.0 seconds"]
        Frame --> Time

        subgraph Layers["AbsoluteFill Layers"]
            L1["Layer 1: OffthreadVideo<br/>Source video stream"]
            L2["Layer 2: CaptionOverlay<br/>Template + active subtitle"]
            L3["Layer 3: Brand Logo<br/>(optional)"]
        end

        Time --> L2

        subgraph Hook["useActiveSubtitle Hook"]
            Find["Find segment where<br/>start ≤ time < end"]
            Offset["Apply 200ms offset"]
            Word["Return active word<br/>+ segment"]
            Find --> Offset --> Word
        end

        L2 --> Hook

        subgraph Switch["Template Router"]
            T1["viral"]
            T2["minimal"]
            T3["hormozi"]
            T4["mrbeast"]
            T5["+ 5 more..."]
        end

        Word --> Switch
    end

    style Remotion fill:#0a0a0a,stroke:#a855f7,stroke-width:2px,color:#e2e8f0
    style Layers fill:#1a1a2e,stroke:#6366f1,color:#e2e8f0
    style Hook fill:#1a1a2e,stroke:#f59e0b,color:#e2e8f0
    style Switch fill:#1a1a2e,stroke:#ec4899,color:#e2e8f0
Enter fullscreen mode Exit fullscreen mode

Smart Face Tracking: The Python Side

The most interesting algorithmic challenge was the smart cropping. When you convert a 16:9 video to 9:16, you need to decide where to crop — and ideally, you follow the speaker's face.

The processing engine uses MediaPipe's BlazeFace model for detection, then applies a weighted scoring algorithm when multiple faces appear:

score = (size_score * 0.5) + (center_distance * 0.3) + (confidence * 0.2)
Enter fullscreen mode Exit fullscreen mode
  • 50% weight on face size — the speaker is usually closest to camera
  • 30% on center proximity — main subjects tend to be centered
  • 20% on detection confidence — trust the model

After selecting the face per frame, a moving average smoothing filter eliminates jitter:

smoothed_positions = np.convolve(
    raw_positions,
    np.ones(window_size) / window_size,
    mode='same'
)
Enter fullscreen mode Exit fullscreen mode
flowchart LR
    A["Input Video<br/>16:9"] --> B{"Aspect ratio<br/>matches target?"}
    B -->|Yes| C["Skip detection<br/>Center crop"]
    B -->|No| D["MediaPipe<br/>BlazeFace"]
    D --> E["Score faces<br/>size 50% + center 30%<br/>+ confidence 20%"]
    E --> F["Select best face<br/>per frame"]
    F --> G["Moving average<br/>smoothing filter"]
    G --> H["FFmpeg crop<br/>with tracking data"]
    C --> I["Output Video<br/>9:16 / 1:1 / 4:5"]
    H --> I

    style A fill:#1e293b,stroke:#22c55e,color:#e2e8f0
    style D fill:#1e293b,stroke:#22c55e,color:#e2e8f0
    style E fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style G fill:#1e293b,stroke:#a855f7,color:#e2e8f0
    style I fill:#1e293b,stroke:#10b981,color:#e2e8f0
Enter fullscreen mode Exit fullscreen mode

The result? Cinematic-quality camera movement that tracks the speaker without the "security camera" feel. And it's smart enough to skip face detection entirely when the source video already matches the target aspect ratio.


Subtitle Synchronization: The Hard Part Nobody Talks About

Getting subtitles to sync properly across three services with different time formats was the trickiest part:

Stage Format Example
Whisper output Seconds { start: 0.5, end: 2.3 }
Frontend editor Milliseconds { start: 500, end: 2300 }
Remotion renderer Seconds { start: 0.5, end: 2.3 }

One early bug had 60-second videos rendering as 0.06 seconds because the times were getting divided by 1000 twice. Fun times.

I also added a configurable 200ms subtitle offset to account for the perceptual delay between hearing a word and reading it:

const SUBTITLE_OFFSET = 0.2; // seconds
const adjustedTime = currentTime - SUBTITLE_OFFSET;
Enter fullscreen mode Exit fullscreen mode

Small detail, but it makes the subtitles feel perfectly synced.


The Subtitle Cleanup Toolkit

Beyond basic editing, I built automated detection for common subtitle issues:

  • Silence detection — gaps > 1 second between segments
  • Filler word detection — "um", "uh", "like", "you know" (13 defaults)
  • Short segment detection — segments under 300ms (usually noise)

These are pure functions — no side effects, easily testable:

const result = detectIssues(subtitles, config);
const cleaned = removeDetectedIssues(subtitles, result.issues);
Enter fullscreen mode Exit fullscreen mode

One click and your subtitles go from raw Whisper output to clean, professional captions.


Architecture Decisions I'm Proud Of

Screaming Architecture

The folder structure tells you what the app does before you read a single line of code:

features/
└── video/
    ├── components/     # "I render video UI!"
    ├── hooks/          # "I manage video state!"
    ├── containers/     # "I orchestrate video workflows!"
    ├── types/          # "I define video data shapes!"
    └── lib/            # "I provide video utilities!"
Enter fullscreen mode Exit fullscreen mode

Strict Separation of Concerns

API routes are thin — they only handle HTTP and delegate to services:

// API route: ~5 lines of logic
export async function POST(request: NextRequest) {
  const body = await request.json();
  const data = await subtitleGenerationService.generateSubtitles(body);
  return NextResponse.json({ success: true, data });
}
Enter fullscreen mode Exit fullscreen mode

All business logic lives in service classes — reusable, testable, no HTTP concerns:

// Service: all the real work
export class SubtitleGenerationService {
  async generateSubtitles(input) {
    // 1. Call Python transcription
    // 2. Convert time formats
    // 3. Structure response
  }

  async renderWithSubtitles(input) {
    // 1. Send job to Remotion
    // 2. Poll until complete
    // 3. Return video URL
  }
}
Enter fullscreen mode Exit fullscreen mode

Zod Everywhere

Every external boundary is validated with Zod. Types are inferred, not duplicated:

const BrandKitSchema = z.object({
  logoUrl: z.string().optional(),
  logoPosition: z.enum(['top-left', 'top-right', 'bottom-left', 'bottom-right']),
  logoScale: z.number().min(0.1).max(2),
  primaryColor: z.string().regex(/^#[0-9A-Fa-f]{6}$/).optional(),
});

type BrandKit = z.infer<typeof BrandKitSchema>; // Types from schemas, not the other way around
Enter fullscreen mode Exit fullscreen mode

The Tech Stack

Layer Technology Why
Frontend Next.js 16 + React 19 App Router, server components, API routes
Styling Tailwind + shadcn/ui Fast, accessible, consistent
AI Analysis Vercel AI SDK + GPT-4o Structured output for viral clip detection
Transcription OpenAI Whisper Best multilingual accuracy
Face Detection MediaPipe (BlazeFace) Lightweight, real-time, no GPU required
Video Processing FFmpeg + OpenCV Industry standard, battle-tested
Video Rendering Remotion React-based, programmatic, template-friendly
Validation Zod Runtime + TypeScript safety
Monorepo Turborepo + pnpm Fast builds, shared packages

What I Learned

1. Engineer + AI = Multiplied Output

This project would have taken me months working solo. With Claude as a pair programmer, I went from idea to working prototype significantly faster. Not because the AI wrote everything — but because it accelerated the tedious parts (boilerplate, FFmpeg flags, Remotion configuration) so I could focus on architecture and product decisions.

2. Build Your Own Tools

We increasingly depend on external SaaS for everything. Sometimes the best way to learn is to build the tool yourself. You'll understand video processing, ML pipelines, and rendering engines at a depth that no tutorial can give you.

3. Microservices Aren't Just for Big Teams

Even for a solo project, separating Python (video/ML), Remotion (rendering), and Next.js (UI/orchestration) kept each piece simple and focused. When I needed to change the face detection algorithm, I didn't touch a single line of frontend code.

4. Time Formats Will Haunt You

If you're building anything with subtitles: pick one time format (seconds or milliseconds) and stick with it. Document your conversions. Test the boundaries. Your future self will thank you.


What's Next

This is still a personal project, but with features that are already quite powerful:

  • 9 professional caption templates
  • Multi-aspect ratio support (9:16, 1:1, 4:5, 16:9)
  • Brand kit customization (logo, colors, fonts)
  • Silence and filler word auto-cleanup
  • SRT/VTT subtitle export
  • YouTube URL support
  • Multi-language transcription

Some ideas on the roadmap:

  • Batch processing for multiple clips
  • Custom template builder (visual editor)
  • Cloud deployment with render queue scaling
  • Speaker diarization (who said what)

I'd love your feedback:

  • What feature would you want to see?
  • Would you use this for your own content?
  • What would you improve?

PRs, issues, and stars are all welcome.

GitHub: github.com/el-frontend/video-wizard


Built with Next.js, Python, Remotion, and a lot of help from Claude.

Top comments (0)