QAYS KADHIM

Posted on Mar 1

I Built an AI Video Ad Generator with Claude + MCP — Here's the Architecture

#python #programming #beginners #ai

I wanted to see what happens when you give Claude real tools — not a weather API, not a todo app — but image generation, voice synthesis, video composition, and quality grading. Could it orchestrate a full creative pipeline from a single prompt?

The result is AdVideo Creator: an open-source CLI where you type "create a 15-second TikTok ad for artisan coffee" and get back a finished .mp4 file. Script, images, voiceover, music, transitions, subtitles — all generated and composed automatically.

Here's how it works under the hood.

The Problem: Claude Has No Hands

Claude can write an excellent marketing script. Give it a product, a target audience, and a tone — it'll produce a hook, emotional beats, and a call to action that actually works.

Then what?

You still need images. A voiceover. Background music. Video editing. Platform-specific export. And if the script doesn't fit the timing after you lay it over the visuals, you go back to Claude, ask for a rewrite, and start the cycle again.

This is the gap between "AI chatbot" and "AI application." Claude can think about your ad, but it can't make it. It has no hands.

Giving Claude Hands with MCP

The Model Context Protocol solves this. MCP is an open protocol that defines how AI models discover and use external tools. Think of it like HTTP but for AI capabilities — a standardized way for a client (the AI) and a server (the tools) to talk to each other.

The architecture is simple:

┌─────────────────────────────────────┐
│           CLIENT (Python)           │
│  User ←→ Claude API ←→ Tool Router │
└──────────────┬──────────────────────┘
               │ stdio (JSON-RPC)
┌──────────────┴──────────────────────┐
│           MCP SERVER (Python)       │
│  Image │ Voice │ Video │ Grading   │
│  Stock │ Brand │ Cache │ System    │
└─────────────────────────────────────┘

The client handles conversation with Claude. The server handles doing things — generating images, producing voiceover, composing video. They communicate through stdio using the MCP protocol.

Claude never talks to the server directly. The client is always the intermediary: Claude decides what tools to call, the client routes those calls to the server, and the server executes them.

The 15-Step Pipeline

When you ask for an ad, Claude doesn't just run one tool. It orchestrates a 15-step pipeline, calling different tools at each stage:

Brief → Template → Project → Script → Grade → Iterate → Save
  → Images (Gate 1) → Voiceover (Gate 2) → Music (Gate 3)
  → Consistency (Gate 4) → Compose (Gate 5)
  → Subtitles → Export → Deliver

The critical thing: Claude decides the order, not the code. There's no hardcoded workflow. Claude sees all 45 tools and their descriptions, and it figures out which ones to call and when. The system prompt gives it a recommended pipeline, but Claude adapts — if the user imports their own images, it skips image generation. If they want stock footage instead of AI images, it searches Pexels.

Here's what a real session looks like behind the scenes:

User: "Create a 15s TikTok ad for artisan coffee beans"

Claude → create_project("coffee-ad", "tiktok", 15)
Claude → get_ad_template("problem-agitate-solve")
Claude → save_script(project_id, script_json)
Claude → search_stock_video("tired person morning")
Claude → use_stock_video(project_id, scene_0, video_id)
Claude → generate_scene_image(project_id, 1, "coffee bag close-up...")
Claude → evaluate_scene_image(project_id, 1)        ← Quality Gate
Claude → generate_scene_image(project_id, 2, "person smiling...")
Claude → evaluate_scene_image(project_id, 2)        ← Quality Gate
Claude → generate_voiceover(project_id, script_text)
Claude → evaluate_voiceover(project_id)              ← Quality Gate
Claude → generate_background_music(project_id, "energetic")
Claude → evaluate_background_music(project_id)       ← Quality Gate
Claude → evaluate_asset_consistency(project_id)       ← Quality Gate
Claude → compose_video(project_id, timeline)
Claude → evaluate_composition(project_id)             ← Quality Gate
Claude → add_subtitles(project_id, "word_highlight")
Claude → export_video(project_id, "tiktok")

That's ~18 tool calls from a single user message. Each one goes through the MCP protocol: Claude emits a tool call → client routes to server → server executes → result goes back to Claude → Claude decides what's next.

The Part I'm Most Proud Of: Quality Gates

Here's where it gets interesting. Most AI pipelines generate output and hope for the best. AdVideo Creator has 5 quality gates that grade every generated asset automatically:

Gate	What It Checks	Pass Threshold
Scene Image	CLIP similarity to prompt, safe-zone compliance, framing	7.0/10
Voiceover	Whisper transcription vs script, WPM pacing, duration fit	7.5/10
Background Music	BPM, duration match, loop quality, mix compatibility	7.0/10
Cross-Asset Consistency	Color palette coherence, pacing alignment, energy match	6.5/10
Final Composition	Duration accuracy, audio balance, platform spec compliance	7.5/10

When an asset fails a gate, Claude retries — but not randomly. The system follows a drift prevention rule: always retry from the original parameters with a targeted fix, never modify the previous retry's parameters. This prevents the common problem where each retry drifts further from the creative direction.

For images, the fix is additive — append a composition hint like "leave center space for text." For voiceover, it's subtractive — shorten the text if pacing is too fast. For music, it's a swap — try a different mood keyword. For consistency, it's surgical — only regenerate the outlier assets.

The graders themselves use real signal processing:

Image grading: CLIP similarity score between the prompt and generated image, plus safe-zone compliance checking that important content isn't cut off at platform edges
Voiceover grading: Whisper transcription compared against the original script text, words-per-minute checking against language-specific ranges (English: 130-170 WPM, Arabic: 100-140 WPM)
Music grading: librosa for BPM extraction, pydub for loudness analysis and loop-point detection
Consistency grading: K-means clustering on color palettes across all scene images, BPM-to-pacing correlation

Script Self-Grading

Before any assets are generated, Claude grades its own script on 6 marketing criteria:

Criteria	Weight
Hook Strength	25%
Emotional Appeal	20%
CTA Clarity	20%
Audience Targeting	15%
Pacing & Flow	10%
Memorability	10%

Scripts must score 8.0/10 or higher. If they don't, Claude identifies the weakest criterion and rewrites targeting that specific weakness — up to 3 iterations. This means the script is already strong before the expensive image and voice generation starts.

The grading rubric lives in the MCP server as a resource (config://grading-rubric), not hardcoded in the prompt. Claude reads it at runtime. This means you can modify the rubric without touching any code.

8 Ad Templates

Claude doesn't write scripts from scratch — it uses proven frameworks:

Problem-Agitate-Solve — Hook with pain point, amplify the problem, reveal the solution
Before/After — Show the transformation
Testimonial — Social proof format
Product Demo — Feature showcase
Trend Hijack — Ride a current trend
Countdown/Urgency — Limited time offers
Storytelling — Mini narrative arc
UGC Style — Raw, authentic feel

Each template defines a scene structure — how many scenes, what each scene should contain, where the hook goes, where the CTA lands. Claude selects the best template for the product type and follows its structure while adapting the content.

Multi-Backend Architecture

The tool has tiered fallbacks for each capability:

Image generation: Replicate (Flux Schnell, ~1-2s, ~$0.003/image) → HuggingFace (free, ~3-5s) → Local SDXL (free, requires GPU)

Voice synthesis: ElevenLabs (ultra-natural, ~$0.06/ad) → OpenAI TTS (~$0.003/ad)

Stock video: Pexels API (free, 200 req/hour)

The factory pattern makes this transparent — create_image_engine() checks which API keys are available and returns the best backend. Add a new key to .env and the entire pipeline upgrades automatically. Remove it and it gracefully falls back.

The minimum setup is just an Anthropic API key. Everything else is optional. You can generate a complete ad for as little as $0.01 (Anthropic only, no images/voice) or $0.10-$0.15 with all premium backends.

Multilingual: Arabic RTL Support

This was the hardest engineering challenge. The pipeline supports full Arabic ads with:

RTL text rendering — Pillow's HarfBuzz backend with Noto Sans Arabic font, automatic text reshaping
Per-language voice defaults — Arabic uses ElevenLabs eleven_multilingual_v2 with stability tuned to 0.50 (vs 0.35 default) for more consistent Arabic pronunciation
Language-aware grading — Arabic has different WPM ranges (100-140 vs English's 130-170), and the voiceover grader normalizes Arabic text (strips tashkeel, normalizes hamza) before comparing against Whisper transcription

Not many AI tools handle RTL correctly. Getting Arabic subtitles to render properly over video, with the right font and correct text direction, required diving deep into Pillow's text rendering internals.

The MCP Server: 45 Tools, 12 Resources

The server exposes everything through MCP's three primitives:

Tools (45) — actions Claude can take. Project management, image generation, voice synthesis, video composition, quality grading, brand profiles, stock video search, asset import, cache management.

Resources (12) — read-only data Claude can access. Platform specs, style presets, grading rubrics, pricing info, voice catalogs, ad templates.

Prompts (8) — reusable instruction templates. The main system prompt with the 15-step workflow, the script grader, the asset grader with drift prevention rules.

The key design decision: everything is discoverable at runtime. When the client connects, it calls tools/list and gets back all 45 tools with their schemas. It calls resources/list and gets all 12 resources. Claude sees everything and decides what to use. Add a new tool to the server? Claude picks it up on the next connection.

What I Learned

Building this taught me patterns that apply to any AI application:

Claude is better at orchestration than you'd expect. Given clear tool descriptions and a recommended workflow, Claude makes remarkably good decisions about which tools to call and in what order. The key is writing descriptive tool descriptions — Claude reads them carefully.

Quality gates change everything. Without them, you get "generate and pray." With them, you get consistent, predictable output. The cost overhead is small (~5-10% of total pipeline cost for grading) and the quality improvement is significant.

Drift prevention matters. When retrying failed generations, always go back to the original parameters and apply a targeted fix. Never modify the previous retry's output. This single rule eliminated most of our "the 3rd retry looks nothing like what was requested" problems.

MCP's separation of concerns pays off. Building the server independently from the client made development much faster. I could test every tool with MCP Inspector (a web UI) without making a single Claude API call. And the same server works with Claude Desktop, no modifications needed.

Try It Yourself

AdVideo Creator is MIT licensed and open source:

GitHub: github.com/UrNas/advideo-creator

Minimum setup: Python 3.12+, FFmpeg, and an Anthropic API key. That's it.

git clone https://github.com/UrNas/advideo-creator.git
cd advideo-creator
uv sync
cp .env.example .env    # add your ANTHROPIC_API_KEY
uv run python main.py

Add more API keys (Replicate, ElevenLabs, OpenAI, Pexels) to unlock premium features.

If you're interested in learning how to build this kind of AI application from scratch — tool design, agentic loops, quality gates, engine abstractions — I'm working on a full course covering every module in detail. Star the repo and follow for updates.

Top comments (4)

klement Gunndu • Mar 2

The MCP architecture giving Claude direct tool access for image gen, voice synthesis, and video composition is a clever way to avoid the usual multi-API orchestration mess. Smart design choice.

QAYS KADHIM • Mar 2

Thanks! That was exactly the thinking. instead of juggling API clients and custom orchestration code, MCP lets Claude call tools directly through a standardized protocol. One server, 45 tools, and Claude handles the sequencing. It made the whole pipeline surprisingly simple to extend.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.