Building a Production AI Video Pipeline: Architecture Deep Dive

#ai #architecture #python #machinelearning

Building a production-grade AI video system is nothing like the demos suggest.

I've spent the last year building ZipX Pro — a platform that takes a script and outputs a complete multi-episode short drama using AI. Here's the actual architecture, with the decisions that actually mattered.

The Core Problem: Models Are Stateless, Stories Are Not

Every major AI video model (Veo3, Kling, Seedance, HappyHorse) operates on a single prompt. It has no memory of what it generated before. For a 30-second clip, this is fine. For a six-episode drama with consistent characters, this is catastrophic.

The fundamental architecture challenge: how do you make stateless generation feel stateful?

Our Answer: The Character Bible System

Before any video is generated, we extract a structured "bible" from the script:

@dataclass
class CharacterBible:
    character_id: str
    appearance: dict       # face_description, hair, build, age_range
    wardrobe: dict         # outfit per scene context
    voice_profile: str     # TTS voice ID + style params
    emotional_range: list  # ["stoic", "explosive", "melancholy"]
    reference_frames: list # URLs of approved generated frames

@dataclass
class SceneBible:
    location_id: str
    lighting_palette: str  # "golden hour, warm 3200K, soft shadows"
    camera_style: str      # "handheld, 24mm equivalent, low angles"
    approved_frames: list  # locked reference frames for this location

Every generation call appends relevant bible entries to the prompt. Character drift drops from ~40% per shot to under 2%.

The Model Router

We don't use one model. We use four, routed by shot requirements:

class ShotRouter:
    def route(self, shot: Shot) -> str:
        # Emotion-heavy close-up? HappyHorse wins on Emotion Transfer
        if shot.shot_type == "CU" and shot.emotional_intensity > 0.7:
            return "happyhorse"

        # Action with physics (water, fight, crowd)?
        if shot.has_physics_elements:
            return "kling_2"

        # Establishing shot needing cinematic quality?
        if shot.shot_type in ("ELS", "LS") and shot.cinematic_priority:
            return "veo3"

        # Default: Seedance for speed + cost efficiency
        return "seedance"

This alone reduces per-minute generation cost by ~60% compared to routing everything through Veo3.

The Continuity Agent

This is where most pipeline attempts fail. A simple "check if it looks consistent" prompt doesn't work — LLMs hallucinate consistency.

Our approach: frame-level embedding comparison.

class ContinuityAgent:
    def __init__(self):
        self.clip_model = load_clip()  # visual embeddings
        self.bible = CharacterBible()

    def check_frame(self, frame_url: str, character_id: str) -> ContinuityResult:
        frame_embedding = self.clip_model.encode_image(frame_url)
        bible_embeddings = [
            self.clip_model.encode_image(ref)
            for ref in self.bible.get_references(character_id)
        ]

        similarity = max(
            cosine_similarity(frame_embedding, ref)
            for ref in bible_embeddings
        )

        if similarity < 0.82:  # empirically tuned threshold
            return ContinuityResult(
                passed=False,
                reason="character_drift",
                regen_prompt=self.build_correction_prompt(frame_url, character_id)
            )
        return ContinuityResult(passed=True)

Frames below the threshold trigger automatic re-generation with an enhanced prompt that includes the reference frames directly.

The Quality Gate Pipeline

Generated Frame
      ↓
  CLIP Similarity Check  ──(fail)──→  Re-generate (max 3 attempts)
      ↓ pass
  Lighting Consistency   ──(fail)──→  Color grade correction
      ↓ pass
  Resolution + Artifacts ──(fail)──→  Upscale or discard
      ↓ pass
  Approved Frame Pool

The three-attempt limit is critical. If a shot fails three times, it gets flagged for human review rather than silently degrading quality.

Throughput Numbers

Running on a single A100 node with 8 parallel agent workers:

Script to storyboard: ~4 minutes (35-scene episode)
Shot generation (parallel): ~45 minutes per episode
Continuity + QA passes: ~15 minutes per episode
Audio sync + export: ~8 minutes per episode

Total: ~72 minutes per episode of final-cut-ready content.

A six-episode drama that took a human team 3 weeks now takes 7 hours of compute.

What We Got Wrong (And Fixed)

Wrong assumption #1: More agents = better quality.
We started with 12 specialized agents. Coordination overhead killed throughput. We consolidated to 6 core agents and 29 tool functions. 40% faster.

Wrong assumption #2: Single re-generation is enough.
Our first QA loop re-generated once on failure. Real-world failure rates cluster — a bad scene setup causes 70% of subsequent frames to fail too. The fix: detect upstream failures early, regenerate from the storyboard stage.

Wrong assumption #3: Users want model choice.
We exposed the full model router to users. They were paralyzed by options. Hidden router with quality as the only dial: 3x better retention.

The full pipeline is live at ZipX Pro. Free tier available — bring your script, get a rough cut. We're also working on a developer API for teams who want to embed the pipeline in their own products.

Questions on the architecture? Drop them in the comments.