DEV Community

QAYS KADHIM
QAYS KADHIM

Posted on

I Built an AI Video Ad Generator with Claude + MCP — Here's the Architecture

I wanted to see what happens when you give Claude real tools — not a weather API, not a todo app — but image generation, voice synthesis, video composition, and quality grading. Could it orchestrate a full creative pipeline from a single prompt?

The result is AdVideo Creator: an open-source CLI where you type "create a 15-second TikTok ad for artisan coffee" and get back a finished .mp4 file. Script, images, voiceover, music, transitions, subtitles — all generated and composed automatically.

Here's how it works under the hood.


The Problem: Claude Has No Hands

Claude can write an excellent marketing script. Give it a product, a target audience, and a tone — it'll produce a hook, emotional beats, and a call to action that actually works.

Then what?

You still need images. A voiceover. Background music. Video editing. Platform-specific export. And if the script doesn't fit the timing after you lay it over the visuals, you go back to Claude, ask for a rewrite, and start the cycle again.

This is the gap between "AI chatbot" and "AI application." Claude can think about your ad, but it can't make it. It has no hands.

Giving Claude Hands with MCP

The Model Context Protocol solves this. MCP is an open protocol that defines how AI models discover and use external tools. Think of it like HTTP but for AI capabilities — a standardized way for a client (the AI) and a server (the tools) to talk to each other.

The architecture is simple:

┌─────────────────────────────────────┐
│           CLIENT (Python)           │
│  User ←→ Claude API ←→ Tool Router │
└──────────────┬──────────────────────┘
               │ stdio (JSON-RPC)
┌──────────────┴──────────────────────┐
│           MCP SERVER (Python)       │
│  Image │ Voice │ Video │ Grading   │
│  Stock │ Brand │ Cache │ System    │
└─────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The client handles conversation with Claude. The server handles doing things — generating images, producing voiceover, composing video. They communicate through stdio using the MCP protocol.

Claude never talks to the server directly. The client is always the intermediary: Claude decides what tools to call, the client routes those calls to the server, and the server executes them.

The 15-Step Pipeline

When you ask for an ad, Claude doesn't just run one tool. It orchestrates a 15-step pipeline, calling different tools at each stage:

Brief → Template → Project → Script → Grade → Iterate → Save
  → Images (Gate 1) → Voiceover (Gate 2) → Music (Gate 3)
  → Consistency (Gate 4) → Compose (Gate 5)
  → Subtitles → Export → Deliver
Enter fullscreen mode Exit fullscreen mode

The critical thing: Claude decides the order, not the code. There's no hardcoded workflow. Claude sees all 45 tools and their descriptions, and it figures out which ones to call and when. The system prompt gives it a recommended pipeline, but Claude adapts — if the user imports their own images, it skips image generation. If they want stock footage instead of AI images, it searches Pexels.

Here's what a real session looks like behind the scenes:

User: "Create a 15s TikTok ad for artisan coffee beans"

Claude → create_project("coffee-ad", "tiktok", 15)
Claude → get_ad_template("problem-agitate-solve")
Claude → save_script(project_id, script_json)
Claude → search_stock_video("tired person morning")
Claude → use_stock_video(project_id, scene_0, video_id)
Claude → generate_scene_image(project_id, 1, "coffee bag close-up...")
Claude → evaluate_scene_image(project_id, 1)        ← Quality Gate
Claude → generate_scene_image(project_id, 2, "person smiling...")
Claude → evaluate_scene_image(project_id, 2)        ← Quality Gate
Claude → generate_voiceover(project_id, script_text)
Claude → evaluate_voiceover(project_id)              ← Quality Gate
Claude → generate_background_music(project_id, "energetic")
Claude → evaluate_background_music(project_id)       ← Quality Gate
Claude → evaluate_asset_consistency(project_id)       ← Quality Gate
Claude → compose_video(project_id, timeline)
Claude → evaluate_composition(project_id)             ← Quality Gate
Claude → add_subtitles(project_id, "word_highlight")
Claude → export_video(project_id, "tiktok")
Enter fullscreen mode Exit fullscreen mode

That's ~18 tool calls from a single user message. Each one goes through the MCP protocol: Claude emits a tool call → client routes to server → server executes → result goes back to Claude → Claude decides what's next.

The Part I'm Most Proud Of: Quality Gates

Here's where it gets interesting. Most AI pipelines generate output and hope for the best. AdVideo Creator has 5 quality gates that grade every generated asset automatically:

Gate What It Checks Pass Threshold
Scene Image CLIP similarity to prompt, safe-zone compliance, framing 7.0/10
Voiceover Whisper transcription vs script, WPM pacing, duration fit 7.5/10
Background Music BPM, duration match, loop quality, mix compatibility 7.0/10
Cross-Asset Consistency Color palette coherence, pacing alignment, energy match 6.5/10
Final Composition Duration accuracy, audio balance, platform spec compliance 7.5/10

When an asset fails a gate, Claude retries — but not randomly. The system follows a drift prevention rule: always retry from the original parameters with a targeted fix, never modify the previous retry's parameters. This prevents the common problem where each retry drifts further from the creative direction.

For images, the fix is additive — append a composition hint like "leave center space for text." For voiceover, it's subtractive — shorten the text if pacing is too fast. For music, it's a swap — try a different mood keyword. For consistency, it's surgical — only regenerate the outlier assets.

The graders themselves use real signal processing:

  • Image grading: CLIP similarity score between the prompt and generated image, plus safe-zone compliance checking that important content isn't cut off at platform edges
  • Voiceover grading: Whisper transcription compared against the original script text, words-per-minute checking against language-specific ranges (English: 130-170 WPM, Arabic: 100-140 WPM)
  • Music grading: librosa for BPM extraction, pydub for loudness analysis and loop-point detection
  • Consistency grading: K-means clustering on color palettes across all scene images, BPM-to-pacing correlation

Script Self-Grading

Before any assets are generated, Claude grades its own script on 6 marketing criteria:

Criteria Weight
Hook Strength 25%
Emotional Appeal 20%
CTA Clarity 20%
Audience Targeting 15%
Pacing & Flow 10%
Memorability 10%

Scripts must score 8.0/10 or higher. If they don't, Claude identifies the weakest criterion and rewrites targeting that specific weakness — up to 3 iterations. This means the script is already strong before the expensive image and voice generation starts.

The grading rubric lives in the MCP server as a resource (config://grading-rubric), not hardcoded in the prompt. Claude reads it at runtime. This means you can modify the rubric without touching any code.

8 Ad Templates

Claude doesn't write scripts from scratch — it uses proven frameworks:

  • Problem-Agitate-Solve — Hook with pain point, amplify the problem, reveal the solution
  • Before/After — Show the transformation
  • Testimonial — Social proof format
  • Product Demo — Feature showcase
  • Trend Hijack — Ride a current trend
  • Countdown/Urgency — Limited time offers
  • Storytelling — Mini narrative arc
  • UGC Style — Raw, authentic feel

Each template defines a scene structure — how many scenes, what each scene should contain, where the hook goes, where the CTA lands. Claude selects the best template for the product type and follows its structure while adapting the content.

Multi-Backend Architecture

The tool has tiered fallbacks for each capability:

Image generation: Replicate (Flux Schnell, ~1-2s, ~$0.003/image) → HuggingFace (free, ~3-5s) → Local SDXL (free, requires GPU)

Voice synthesis: ElevenLabs (ultra-natural, ~$0.06/ad) → OpenAI TTS (~$0.003/ad)

Stock video: Pexels API (free, 200 req/hour)

The factory pattern makes this transparent — create_image_engine() checks which API keys are available and returns the best backend. Add a new key to .env and the entire pipeline upgrades automatically. Remove it and it gracefully falls back.

The minimum setup is just an Anthropic API key. Everything else is optional. You can generate a complete ad for as little as $0.01 (Anthropic only, no images/voice) or $0.10-$0.15 with all premium backends.

Multilingual: Arabic RTL Support

This was the hardest engineering challenge. The pipeline supports full Arabic ads with:

  • RTL text rendering — Pillow's HarfBuzz backend with Noto Sans Arabic font, automatic text reshaping
  • Per-language voice defaults — Arabic uses ElevenLabs eleven_multilingual_v2 with stability tuned to 0.50 (vs 0.35 default) for more consistent Arabic pronunciation
  • Language-aware grading — Arabic has different WPM ranges (100-140 vs English's 130-170), and the voiceover grader normalizes Arabic text (strips tashkeel, normalizes hamza) before comparing against Whisper transcription

Not many AI tools handle RTL correctly. Getting Arabic subtitles to render properly over video, with the right font and correct text direction, required diving deep into Pillow's text rendering internals.

The MCP Server: 45 Tools, 12 Resources

The server exposes everything through MCP's three primitives:

Tools (45) — actions Claude can take. Project management, image generation, voice synthesis, video composition, quality grading, brand profiles, stock video search, asset import, cache management.

Resources (12) — read-only data Claude can access. Platform specs, style presets, grading rubrics, pricing info, voice catalogs, ad templates.

Prompts (8) — reusable instruction templates. The main system prompt with the 15-step workflow, the script grader, the asset grader with drift prevention rules.

The key design decision: everything is discoverable at runtime. When the client connects, it calls tools/list and gets back all 45 tools with their schemas. It calls resources/list and gets all 12 resources. Claude sees everything and decides what to use. Add a new tool to the server? Claude picks it up on the next connection.

What I Learned

Building this taught me patterns that apply to any AI application:

Claude is better at orchestration than you'd expect. Given clear tool descriptions and a recommended workflow, Claude makes remarkably good decisions about which tools to call and in what order. The key is writing descriptive tool descriptions — Claude reads them carefully.

Quality gates change everything. Without them, you get "generate and pray." With them, you get consistent, predictable output. The cost overhead is small (~5-10% of total pipeline cost for grading) and the quality improvement is significant.

Drift prevention matters. When retrying failed generations, always go back to the original parameters and apply a targeted fix. Never modify the previous retry's output. This single rule eliminated most of our "the 3rd retry looks nothing like what was requested" problems.

MCP's separation of concerns pays off. Building the server independently from the client made development much faster. I could test every tool with MCP Inspector (a web UI) without making a single Claude API call. And the same server works with Claude Desktop, no modifications needed.

Try It Yourself

AdVideo Creator is MIT licensed and open source:

GitHub: github.com/UrNas/advideo-creator

Minimum setup: Python 3.12+, FFmpeg, and an Anthropic API key. That's it.

git clone https://github.com/UrNas/advideo-creator.git
cd advideo-creator
uv sync
cp .env.example .env    # add your ANTHROPIC_API_KEY
uv run python main.py
Enter fullscreen mode Exit fullscreen mode

Add more API keys (Replicate, ElevenLabs, OpenAI, Pexels) to unlock premium features.


If you're interested in learning how to build this kind of AI application from scratch — tool design, agentic loops, quality gates, engine abstractions — I'm working on a full course covering every module in detail. Star the repo and follow for updates.

Top comments (4)

Collapse
 
vibeyclaw profile image
Vic Chen

Using MCP as the glue between Claude and external tools is a smart architectural choice — it keeps the AI layer clean and makes the pipeline more composable. The separation of script generation (Claude) from media execution (FFmpeg/ElevenLabs via MCP tools) is the right call for maintainability. Curious how you handle prompt engineering for the script phase: do you give Claude explicit constraints on duration/pacing, or let it be loose and clip in post? Also — for ad-quality output, have you experimented with ControlNet-style keyframe consistency, or is the current approach fully sequential? The architecture is solid. Would love to see a follow-up post benchmarking output quality across different ad formats (product demo vs. brand awareness vs. CTA-heavy).

Collapse
 
qays_kadhim_c3fea1c94957f profile image
QAYS KADHIM

For the script phase, Claude gets explicit constraints. duration target, pacing rules, and 6 marketing criteria (hook strength, emotional arc, CTA clarity, etc.). The script actually self-grades against these criteria and rewrites until it hits an 8.0/10 threshold. So it's structured, not loose.

For visual consistency, the current approach is sequential with a shared style prompt that gets locked in early and passed to every image generation call. Haven't explored ControlNet-style keyframe consistency yet. that's a solid idea worth experimenting with, especially for longer formats.

And a benchmark post across ad formats is a great suggestion. The template system already supports product demos, brand awareness, and CTA-heavy formats with different pacing profiles, so there's real data to compare. Might be the next write-up. Thanks for the thoughtful feedback.

Collapse
 
klement_gunndu profile image
klement Gunndu

The MCP architecture giving Claude direct tool access for image gen, voice synthesis, and video composition is a clever way to avoid the usual multi-API orchestration mess. Smart design choice.

Collapse
 
qays_kadhim_c3fea1c94957f profile image
QAYS KADHIM

Thanks! That was exactly the thinking. instead of juggling API clients and custom orchestration code, MCP lets Claude call tools directly through a standardized protocol. One server, 45 tools, and Claude handles the sequencing. It made the whole pipeline surprisingly simple to extend.