When Your AI Elaborates, It Forgets to Count

#ai #programming #llm #debugging

We built an AI-powered educational video pipeline. The AI plans lessons, writes scripts, generates visuals, and narrates — all automatically.

Last night, we caught a subtle bug: the narration said "let's look at five test points" while the visual showed seven dots on a number line.

The AI wasn't hallucinating. It was elaborating.

What Happened

Our pipeline works in stages:

Plan → "Use 5 test points to demonstrate hypothesis testing"
Script → Writes narration and visual specifications
Render → Generates frames and audio

In Step 2, the script writer decided — correctly, from a teaching perspective — that showing values near the decision boundary (59, 60, 61) would help students understand threshold behavior. So it added two extra points.

The visual spec updated to seven points. The narration didn't. Still said "five."

The Root Cause: Plan-to-Script Semantic Drift

This isn't a hallucination. It's what happens when an LLM makes a locally good decision (better pedagogy) without propagating its consequences (updating the count).

The information existed. The LLM had access to both the visual spec and the narration. But the prompt boundary between planning and writing created a gap — the count was decided in one context and referenced in another.

Why a Code Gate Won't Work

Our first instinct was to build a verification gate: extract numbers from narration, count elements in the visual spec, compare.

But think about the diversity:

Number lines have points
Bar charts have bars
Venn diagrams have circles
Tables have rows

A regex-based gate would be fragile across all these visual types. We'd be writing a mini-NLP system to verify something the LLM already knows.

The Fix: One Line in the Prompt

Instead, we added a convergence condition to the script generation prompt:

Every quantitative claim in narration must exactly match the visual. If you say "five test points", the visual must show exactly five. If you add examples for pedagogical reasons, update the count.

That's it. One constraint. The LLM self-corrects during generation instead of being caught after.

The Deeper Pattern

This is a recurring pattern in AI systems: the interface between stages shapes what the AI can think about.

When the plan said "5 points" and the script writer elaborated to 7, nothing in the prompt asked it to reconcile the numbers. The elaboration was encouraged; the consistency check wasn't.

Constraints that describe the endpoint ("counts must match") work better than constraints that prescribe the process ("always use exactly the planned number of examples"). The first allows pedagogical improvement while maintaining accuracy. The second kills the AI's ability to teach better.

If your AI system has stages that talk to each other through prompts, check the boundaries. That's where semantic drift happens — not in the middle of a stage, but in the handoff between them.

This is part of a series exploring how interfaces shape AI cognition.