QAYS KADHIM

Posted on Mar 3 • Edited on Mar 5

I Tested My AI Ad Generator on 3 Completely Different Ad Formats — Here's What Actually Happened

#python #programming #beginners #ai

I recently open-sourced AdVideo Creator, a CLI tool that lets Claude generate complete video ads — script, images, voiceover, music, and final video — through a single prompt. In my first post, I walked through the architecture: 45 MCP tools, 5 quality gates, and a 15-step pipeline.

The response was great. But one comment stuck with me:

"Would love to see a follow-up post benchmarking output quality across different ad formats."

Fair point. Architecture posts are nice, but what actually comes out the other end? So I picked 3 very different ad scenarios, ran them through the full pipeline, and recorded everything — scores, retries, failures, and the final videos.

Here's what happened.

The 3 Tests

I deliberately chose formats that stress different parts of the pipeline:

	Product Demo	Storytelling	CTA / Urgency
Product	HydroSync (smart water bottle)	Ember & Oak (coffee roastery)	SkillSprint (online courses)
Template	Product Demo (5 scenes)	Storytelling (5 scenes)	Countdown/Urgency (4 scenes)
Platform	TikTok 1080×1920	Instagram Reel 1080×1920	Instagram Feed 1080×1080
Duration	15 seconds	30 seconds	15 seconds
Image Style	Photorealistic	Watercolor	Flat-design
Language	English	English	Arabic (RTL)
Voice	ElevenLabs Elli	ElevenLabs Rachel	ElevenLabs Adam

Each test uses a different template, platform, aspect ratio, image style, and voice. The Arabic test also throws RTL text rendering into the mix.

Test 1: Product Demo — HydroSync Smart Water Bottle

The prompt:

Create a 15 second TikTok product demo ad for HydroSync — a smart water bottle that tracks your daily hydration and syncs with your phone app. Target audience is fitness-conscious millennials. Tone: energetic and modern.

What happened:

This was the smoothest run. The script passed on the first attempt at 8.05/10. Claude wrote a tight 5-scene structure: bold product reveal, two feature highlights (hydration tracking, phone sync), a lifestyle benefit shot, and a CTA.

Image generation was fast — all 5 scenes generated via Replicate Flux Schnell in about 2 seconds each. The photorealistic style produced clean, product-shot-style images that scored 9.88/10 average. Voiceover landed at 9.67/10 on the first try.

The final video exported at 14.4 seconds, 1080×1920, 12.9 MB. Hardware acceleration kicked in via Apple VideoToolbox.

The catch: The pipeline hit the 20-tool round limit before it could add subtitles or run the final composition scoring. The video still exported fine — it just skipped those last two steps.

Verdict: Product demos are the tool's sweet spot. Clear features, simple structure, photorealistic images — everything lines up.

Test 2: Storytelling — Ember & Oak Coffee Roastery

The prompt:

Create a 30 second Instagram Reel storytelling ad for Ember & Oak, a small-batch coffee roastery that partners directly with farmers in Colombia. The story should follow a farmer's journey from harvest to your cup. Tone: warm, authentic, emotional.

What happened:

This is where the self-grading system proved its value.

The first script scored 7.7/10. The grading system flagged two specific problems: the hook was generic (7/10) and the CTA was weak (6/10). Claude rewrote the script. The new hook — a pattern interrupt about coffee traveling 3,000 miles — scored 9/10. The CTA got specific. Version 2 passed at 8.4/10.

The watercolor image style was interesting. Four of the five scenes looked cohesive and atmospheric. Scene 2 (the discovery scene) scored lowest at 7.91 — slightly less watercolor consistency than the others. The average still held strong at 9.36/10.

Voiceover had a hiccup. The first attempt ran 32.79 seconds — almost 3 seconds over the 30-second target. The quality gate caught it, auto-shortened the text, and the retry came in at 29.95 seconds with a 9.0/10 score.

This was the only test where the full pipeline completed — including subtitles and composition scoring (8.35/10). The final video landed at exactly 30.0 seconds, 27.5 MB.

Verdict: Storytelling ads need more iteration, but the quality gates handle it. The self-grading loop catching the weak hook is exactly what you want from an automated system.

Test 3: CTA / Urgency — SkillSprint Flash Sale (Arabic)

The prompt:

Create a 15 second Instagram Feed ad in Arabic for SkillSprint — an online learning platform running a 48-hour flash sale with 60% off all courses. Target audience: Arabic-speaking young professionals. Tone: urgent and exciting.

What happened:

This was the hardest test by design — Arabic RTL, urgency template, square format, flat-design style. I wanted to push the tool.

The script passed first try at 8.4/10. Urgency ads have a clear structure (limited offer → value → scarcity → CTA), and Claude wrote strong Arabic copy with the right energy.

Then the voiceover became a challenge. Attempt 1 came back at 21.69 seconds — over 6 seconds too long for a 15-second ad. The quality gate caught it and auto-shortened. Attempt 2 scored 7.24/10 — below the 7.5 threshold due to pacing issues. Attempt 3 finally passed at 7.55/10 with 14.35 seconds duration.

Three attempts for voiceover. That's the most retries across all tests.

The cross-asset consistency check scored 6.45/10 — just below the 6.5 threshold. It flagged color palette variations between flat-design scenes. The pipeline noted it needs review but continued to export.

The final video: 14.4 seconds, 1080×1080, 6.4 MB. RTL text overlays rendered correctly with lang: ar. Arabic metadata and hashtags were generated automatically.

Verdict: Arabic ads work, but they're the hardest path. Voice generation needs more attempts, and flat-design consistency across scenes is trickier than photorealistic or watercolor. The pipeline handles it — it just works harder.

The Numbers Side by Side

Metric	Product Demo	Storytelling	CTA/Urgency (Arabic)
Script grade	8.05/10	8.4/10 (v2)	8.4/10
Script iterations	1	2	1
Image quality (avg)	9.88/10	9.36/10	8.99/10
Voice quality	9.67/10	9.0/10	7.55/10
Voice retries	0	1	2
Music quality	8.1/10	8.1/10	7.98/10
Consistency score	9.25/10	7.65/10	6.45/10
Pipeline time	~5 min	~6 min	~3 min
File size	12.9 MB	27.5 MB	6.4 MB

A clear pattern: simpler formats score higher, but the quality gates keep complex formats in check.

5 Things I Learned

1. Self-grading is the most valuable feature.
The storytelling test proved it. A 7.7 script became an 8.4 script because the system knew the hook was weak. Without that feedback loop, the first draft would have gone straight to production.

2. Voice generation is the bottleneck for non-English.
English voiceovers passed on the first try in both tests. Arabic needed 3 attempts. The issue is duration estimation — Arabic speech pacing differs from English, and the first-pass text is often too long. This is a clear area for improvement.

3. Photorealistic is the easiest style for consistency.
The product demo scored 9.25 on consistency. Watercolor dropped to 7.65. Flat-design hit 6.45. Stylized images have more variance between scenes, which makes cross-scene consistency harder. A style-locking mechanism could help here.

4. The tool limit is a real constraint.
Two of three tests hit the 20-tool round limit before completing subtitles and composition scoring. The videos still exported fine, but the pipeline should be optimized to fit within fewer tool calls — or the limit needs to increase.

5. Every ad format exported a real video.
Despite all the retries and edge cases, every test produced a platform-ready video with correct specs. That's the baseline promise, and it held.

What I'd Improve Next

Based on these tests, here's what's on the roadmap:

Arabic voice calibration — Pre-calculate duration estimates using Arabic-specific WPM ranges to reduce retries
Style consistency locking — Extract color palette and visual parameters from Scene 1 and enforce them across all subsequent scenes
Pipeline optimization — Reduce tool calls by batching operations (generate all images in one call, grade them in one call)
Subtitle fallback — Prioritize subtitle generation over composition scoring when approaching the tool limit

Try It Yourself

The tool is open source. Pick one of these three prompts, run it, and see what comes out:

git clone https://github.com/UrNas/advideo-creator.git
cd advideo-creator
cp .env.example .env  # Add your API keys
uv run main.py

Star the repo if you find it useful. Open an issue if something breaks. PRs are welcome.

GitHub: UrNas/advideo-creator

This is part 2 of a series on building AI-powered ad generation. Part 1 covered the architecture. Part 3 will go deeper on the quality gate system and how self-grading actually works under the hood.