Building a production-grade AI video system is nothing like the demos suggest.
I've spent the last year building ZipX Pro — a platform that takes a script and outputs a complete multi-episode short drama using AI. Here's the actual architecture, with the decisions that actually mattered.
The Core Problem: Models Are Stateless, Stories Are Not
Every major AI video model (Veo3, Kling, Seedance, HappyHorse) operates on a single prompt. It has no memory of what it generated before. For a 30-second clip, this is fine. For a six-episode drama with consistent characters, this is catastrophic.
The fundamental architecture challenge: how do you make stateless generation feel stateful?
Our Answer: The Character Bible System
Before any video is generated, we extract a structured "bible" from the script:
@dataclass
class CharacterBible:
character_id: str
appearance: dict # face_description, hair, build, age_range
wardrobe: dict # outfit per scene context
voice_profile: str # TTS voice ID + style params
emotional_range: list # ["stoic", "explosive", "melancholy"]
reference_frames: list # URLs of approved generated frames
@dataclass
class SceneBible:
location_id: str
lighting_palette: str # "golden hour, warm 3200K, soft shadows"
camera_style: str # "handheld, 24mm equivalent, low angles"
approved_frames: list # locked reference frames for this location
Every generation call appends relevant bible entries to the prompt. Character drift drops from ~40% per shot to under 2%.
The Model Router
We don't use one model. We use four, routed by shot requirements:
class ShotRouter:
def route(self, shot: Shot) -> str:
# Emotion-heavy close-up? HappyHorse wins on Emotion Transfer
if shot.shot_type == "CU" and shot.emotional_intensity > 0.7:
return "happyhorse"
# Action with physics (water, fight, crowd)?
if shot.has_physics_elements:
return "kling_2"
# Establishing shot needing cinematic quality?
if shot.shot_type in ("ELS", "LS") and shot.cinematic_priority:
return "veo3"
# Default: Seedance for speed + cost efficiency
return "seedance"
This alone reduces per-minute generation cost by ~60% compared to routing everything through Veo3.
The Continuity Agent
This is where most pipeline attempts fail. A simple "check if it looks consistent" prompt doesn't work — LLMs hallucinate consistency.
Our approach: frame-level embedding comparison.
class ContinuityAgent:
def __init__(self):
self.clip_model = load_clip() # visual embeddings
self.bible = CharacterBible()
def check_frame(self, frame_url: str, character_id: str) -> ContinuityResult:
frame_embedding = self.clip_model.encode_image(frame_url)
bible_embeddings = [
self.clip_model.encode_image(ref)
for ref in self.bible.get_references(character_id)
]
similarity = max(
cosine_similarity(frame_embedding, ref)
for ref in bible_embeddings
)
if similarity < 0.82: # empirically tuned threshold
return ContinuityResult(
passed=False,
reason="character_drift",
regen_prompt=self.build_correction_prompt(frame_url, character_id)
)
return ContinuityResult(passed=True)
Frames below the threshold trigger automatic re-generation with an enhanced prompt that includes the reference frames directly.
The Quality Gate Pipeline
Generated Frame
↓
CLIP Similarity Check ──(fail)──→ Re-generate (max 3 attempts)
↓ pass
Lighting Consistency ──(fail)──→ Color grade correction
↓ pass
Resolution + Artifacts ──(fail)──→ Upscale or discard
↓ pass
Approved Frame Pool
The three-attempt limit is critical. If a shot fails three times, it gets flagged for human review rather than silently degrading quality.
Throughput Numbers
Running on a single A100 node with 8 parallel agent workers:
- Script to storyboard: ~4 minutes (35-scene episode)
- Shot generation (parallel): ~45 minutes per episode
- Continuity + QA passes: ~15 minutes per episode
- Audio sync + export: ~8 minutes per episode
Total: ~72 minutes per episode of final-cut-ready content.
A six-episode drama that took a human team 3 weeks now takes 7 hours of compute.
What We Got Wrong (And Fixed)
Wrong assumption #1: More agents = better quality.
We started with 12 specialized agents. Coordination overhead killed throughput. We consolidated to 6 core agents and 29 tool functions. 40% faster.
Wrong assumption #2: Single re-generation is enough.
Our first QA loop re-generated once on failure. Real-world failure rates cluster — a bad scene setup causes 70% of subsequent frames to fail too. The fix: detect upstream failures early, regenerate from the storyboard stage.
Wrong assumption #3: Users want model choice.
We exposed the full model router to users. They were paralyzed by options. Hidden router with quality as the only dial: 3x better retention.
The full pipeline is live at ZipX Pro. Free tier available — bring your script, get a rough cut. We're also working on a developer API for teams who want to embed the pipeline in their own products.
Questions on the architecture? Drop them in the comments.
Top comments (0)