Why splitting storyboard from rendering beats one mega-prompt for AI comics

#ai #architecture #llm #softwareengineering

The most tempting way to build an AI comic generator is also the worst one: take a paragraph of story, send it to an image model, and ask for a finished page.

It feels like magic when it works. It almost never works.

The story-to-page mega-prompt collapses for the same reason that one-shot UI generation collapses. Too many constraints land on the model at once, and the model has no clean way to negotiate them. So I separated storyboard from rendering early in Comicory, and that single split made everything downstream more controllable.

The mega-prompt confuses two different jobs

A comic page is doing two unrelated things at the same time.

One job is structural. How many panels are on the page. What each panel is about. Who is in it. What they are doing. Where the camera sits. What the dialogue is.

The other job is visual. The actual drawing. Lighting. Style. Character likeness. Background detail. Lettering.

These two jobs need different reasoning. Structure is closer to a writer's mental model. Rendering is closer to an illustrator's. Asking one prompt to do both forces the model to do both at once, and the result is usually a compromise that satisfies neither.

Storyboard is cheap, rendering is expensive

There is also a cost asymmetry that the mega-prompt ignores.

Generating a structural storyboard, even with a strong language model, is fast and cheap. It is essentially a long-form structured response: panel count, panel descriptions, dialogue, camera notes.

Generating the actual panel images is slow and expensive. Every retry costs real money and real time.

If both happen inside one big prompt, every visual retry forces the structural decisions to be rolled too. That is wasteful, and it makes the workflow harder to debug. The user cannot tell whether the failure was a story problem or an image problem.

Splitting them gives the system a chance to validate the cheap part before paying for the expensive part. The structural storyboard can be checked, edited, or regenerated quickly. Only when it is acceptable does the system commit to rendering.

The user is the right person to gate the transition

Even with a perfect model, the storyboard stage should usually be visible to the user.

This sounds like extra friction, but it is exactly where the user wants control. People bring a specific story idea to a comic tool. They have opinions about pacing, about how a beat lands, about which panel should be a close-up. They do not have opinions about brush stroke density.

A short, readable storyboard, presented as a list of panels with a one-line summary and the dialogue for each, gives them a place to spend that attention. They can tweak panel count, rewrite a line, swap a camera angle, before any image is generated.

That is also the part of Comicory that benefits the most from a fast iteration loop. Storyboard edits should feel like editing a doc, not like negotiating with an image model.

Rendering inherits a clear contract

Once the storyboard is settled, the rendering stage has a much smaller and clearer job.

For each panel, it receives an explicit description, a character reference, a camera intent, and the dialogue that must fit. It does not have to invent the page structure or guess what the user meant. It just has to draw the one panel.

That clean handoff makes character consistency easier. It makes panel composition easier. It makes targeted regeneration possible, because the storyboard remains the source of truth across retries.

It also makes the system easier to reason about. If a panel renders badly, the question is just "did the rendering stage do its job for this panel description?" There is no ambiguity about whether the story intent was right.

What this looks like in the product

In Comicory the pipeline ends up roughly like this:

Take the user's story or prompt and produce a structural storyboard.
Show the storyboard to the user so they can edit panel count, descriptions, and dialogue.
Lock the storyboard.
Render each panel against the locked storyboard, reusing the same character reference.
Allow per-panel regeneration without changing the rest of the page.

Each step is its own product surface. Each can be improved on its own. The model does not have to be a comic genius. It just has to be reliable at each stage.

The product lesson

The biggest mistake I see in AI comic demos is collapsing storyboard and rendering into one heroic prompt and then blaming the model when it fails.

The model is not the bottleneck. The architecture is.

Once storyboard and rendering are separated, the model becomes a much more useful collaborator. The user gets a real place to apply taste. The system gets a clean contract between stages. The retries get cheaper. The final page gets better.

That is the kind of structural decision that does not show up in a flashy demo, but determines whether the product is actually usable for someone trying to make a real comic.