Building Continuum: an agent that shoots a whole drama series, not one clip

JasonRobert — Wed, 24 Jun 2026 16:10:03 +0000

Continuum is my entry for the Global AI Hackathon Series with Qwen Cloud (Track 2: AI Showrunner). Code: https://github.com/calderbuild/continuum

I spent the last few weeks building Continuum, an autonomous AI showrunner that produces serialized vertical micro-drama on Qwen and Wan. One agent crew takes a premise and a cast, writes the script, storyboards it, generates the video, scores its own work, and cuts the episode together. Then it does it again for episode two, and the second episode actually looks like it belongs to the first one. That last part is the whole point, and it was also the hard part.

Why micro-drama, and why consistency is the wall everyone hits

Vertical micro-drama is not a niche. The global market was around $11B in 2025, and in China roughly 95% of the 128,000 new titles added each month are already AI-touched in some way. It is a real, validated, money-printing format, and it is going overseas fast. Qwen speaks 119 languages, so the fit there is almost unfair.

The catch shows up the moment you want more than a single clip. Every tool I looked at, Showrunner, AIDrama Studio, Topview, can make you one nice shot. None of them can keep a character looking like the same person across shots, let alone across episodes. The state of the art is a human sitting in the loop, re-rolling generations and hand-picking the takes that match. Nobody had an agent that maintains a memory of the world and corrects its own drift. That gap is the entire reason Continuum exists.

What I actually built

The core is three things working together.

First, a Series Bible. It is a JSON document the agent reads before it shoots and writes after it wraps. It holds each character's appearance prompt, a locked reference image, props, locations, plot threads, and the emotional arc. The rule that makes it work is boring and strict: once a character's look is locked, later merges can add new props and threads but they cannot rewrite the face. That one constraint is what stops the slow drift that ruins every multi-shot attempt.

Second, a critic-optimizer loop. After each clip comes back, Qwen-VL compares the character in it against episode one's anchor frame and returns a structured identity score with a one-line reason. If the score is under threshold, an optimizer rewrites the shot prompt and the cinematographer re-generates. A few tries, capped, then take the best. The agent disagrees with itself and fixes it, with no human watching.

Third, a real consistency number, and it comes from the same Qwen-VL critic rather than a hand-wave. It judges each episode's character against the first appearance and averages the identity match. On the two-episode demo series that lands at 0.98: the same detective, the bob, the cyan circuit tattoo, across completely different scenes. A measured number, not a claim. SSIM via ffmpeg stays in as a zero-dependency fallback, but it scores pixels, not identity, so the VL judge is the one that means something.

For the brain I used Qwen3-max for scripting and prompt optimization, and Qwen-VL as the visual critic. Video runs on Wan text-to-video over Qwen Cloud, with the locked look injected into every prompt; Wan's reference-to-video model is the designed-in next step for tightening identity further, since it was built for exactly this job. The backend is FastAPI with server-sent events for the live view, and it is structured to deploy on Alibaba Cloud Function Compute.

What surprised me

Wan held up better than I expected. I went in assuming I would fight the model for every vertical frame, and instead it gave me coherent 9:16 clips with sensible motion on the first real pass. That freed up days I had budgeted for wrestling with output quality, and I spent them on the Bible and the critic loop instead.

The second surprise was a debugging detour. My async submit-and-poll calls to Wan started failing with 503s from a proxy in front of the API, intermittently, never the same shot twice. I burned a couple of hours chasing it as a bug in my own code before I accepted it was upstream flakiness. The fix was unglamorous: treat 503 as retryable, back off, poll again. Once I stopped trusting the first response and started retrying the transient failures, the pipeline went from "fails one shot in five" to running a full episode unattended. Lesson I keep relearning: when a call should work and doesn't, check the transport before you rewrite the logic.

The third was budget. Video generation was the cost I feared, so I rendered at 720p vertical and kept episodes short while building, and the whole demo series ran inside the free Wan quota. When the free quota got close to empty I set a hard spending cap rather than drift into pay-as-you-go. The money I expected to lose to video generation mostly never left the account.

What's next

The skeleton is autonomous and the moat works, so the follow-ups are about depth. Lip-synced dialogue through speech-to-video is the obvious one; right now I ship subtitles plus narration, which is honest but plain. After that, wiring the consistency scorer into a vector index so the agent can retrieve a character from a library instead of re-deriving it. And the multilingual angle is sitting right there, since Qwen already covers the languages, so one series could ship in a dozen markets from the same Bible.

One person, a few weeks, an agent that ships consecutive episodes whose lead actually holds together. That is the thing I wanted to prove was possible, and now it runs.

Just submitted MeetSpot to the Kiro Hackathon! 🚀

JasonRobert — Sat, 29 Nov 2025 14:55:24 +0000

My favorite thing about @kirodotdev? The Spec-Driven Development.

It completely changed my dev approach. I defined my architecture in .kiro/steering/tech.md, and Kiro generated 90% of my Async FastAPI backend strictly following my patterns. It felt like pair programming with a Principal Architect.