Daniel Romitelli

Posted on Mar 11 • Edited on Mar 23 • Originally published at craftedbydaniel.com

The Closed‑Loop Consistency Trick: Keeping Scene 12 Faithful to Scene 1 Without Global Memory

#mlsystems #video #controlsystems #typescript

I first noticed the failure mode on a long chain when Scene 12 looked plausible—but it no longer felt like it belonged to Scene 1. Nothing was “broken” in any single step. The drift was death-by-a-thousand-cuts: each transition was slightly off, and those slight offs multiplied.

That’s the moment I stopped thinking about consistency as a global memory problem.

In Scenematic, I treat consistency like signal propagation in a circuit. You don’t carry the whole circuit state inside each component. You propagate constraints along edges, attenuate them with distance, periodically re-anchor so noise doesn’t accumulate, then close the loop by measuring outputs and applying targeted corrections.

The closed loop has three modules:

Constraint propagation decides what must be preserved as you move through the scene graph.
Hierarchical scene bridge measures what was preserved and turns it into a concrete transition plan.
Progressive pipeline fixes what wasn’t preserved, using diagnosis instead of blind retries.

The core insight is simple and annoyingly effective:

Long-form video consistency isn’t a memory problem—it’s local propagation with periodic re-anchoring.

When that’s true, Scene 12 doesn’t need to “remember” Scene 1. It only needs to be consistent with Scene 11 under constraints that ultimately originated at Scene 1.

Key insight (early, because it’s the whole game)

A naive chaining system asks every generation step to do an impossible job:

preserve character identity
preserve palette
preserve camera language
obey narrative intent
introduce new content

…and do it all from a single prompt + a single similarity score.

That’s how you get the classic failure: a candidate that scores well on one metric (say, embedding similarity) but breaks something humans notice immediately (palette, motion, composition).

My fix wasn’t “more memory.” It was a loop:

Propagation sets the target (what matters, how strongly, for how long).
The bridge turns that target into generation parameters (strength + prompt modifiers).
The pipeline tests cheaply, measures signals, and corrects the weak ones.

This is why Scene 12 can stay visually consistent with Scene 1 without carrying a global scratchpad.

How it works under the hood

The system is easiest to understand if you follow the data the way it actually flows: constraints → plan → exploration → scoring → corrections → retry → accept.

flowchart TD
  subgraph propagation
    sceneGraph[Scene graph] --> propagate[propagateConstraints BFS]
    propagate --> propagated[PropagatedConstraint list]
    propagated --> materialize[materializePropagated]
    materialize --> sceneConstraints[SceneConstraint list]
  end

  subgraph planning
    sceneConstraints --> bridge[Scene bridge]
    bridge --> transitionPlan[Transition Plan]
  end

  subgraph closedLoop
    transitionPlan --> thinkFrames[Think frames exploration]
    thinkFrames --> scoring[Reward signals]
    scoring --> diagnose[diagnoseAndCorrect]
    diagnose --> refinement[Targeted corrections]
    refinement --> acceptOrRecover[Stage gates and recovery]
  end

I like this shape because it makes the responsibilities crisp:

propagation answers “what must persist?”
bridging answers “what should I do next?”
the loop answers “did it work, and if not, what exactly failed?”

1) Constraint propagation: BFS with attenuation + periodic refresh

The propagation module is the part that makes “no global memory” viable.

If you don’t carry a full history, you still need a way for Scene 1’s constraints—identity, palette, lighting—to influence Scene 12. The trick is to treat the project like a directed graph and push constraints forward along edges.

Propagation is a BFS over the scene graph with three important knobs:

ATTENUATION_PER_STEP = 0.95 (exponential decay)
MAX_PROPAGATION_DEPTH = 5 (don’t let constraints haunt the whole project)
REFRESH_INTERVAL = 5 (periodically re-anchor so drift doesn’t compound)

Here’s the real BFS core from my code:

const nextFrontier: string[] = []
for (const currentId of frontier) {
  const outEdges = graph.edges.filter(e => e.sourceId === currentId)
  for (const edge of outEdges) {
    if (visited.has(edge.targetId)) continue
    visited.add(edge.targetId)
    nextFrontier.push(edge.targetId)
    const needsRefresh = depth % REFRESH_INTERVAL === 0
    const effectiveDepth = needsRefresh ? 1 : depth
    const propagated: PropagatedConstraint[] = propagatable.map(c => ({
      sourceConstraint: c, sourceSceneId,
      propagationDepth: effectiveDepth,
      attenuationFactor: Math.pow(ATTENUATION_PER_STEP, effectiveDepth),
    }))
    result.set(edge.targetId, propagated)
  }
}
frontier = nextFrontier
depth++

The non-obvious part is effectiveDepth.

Without refresh, attenuation compounds forever: Scene 1’s constraints become so weak by Scene 12 that they’re basically vibes. With refresh, every REFRESH_INTERVAL scenes I intentionally treat the constraint as if it’s only one hop old (effectiveDepth = 1). That’s the re-anchor.

It’s not “remember Scene 1.” It’s “periodically re-assert what Scene 1 cares about.”

Materialization: thresholds attenuate, enforcement can downgrade

Propagation produces PropagatedConstraint objects. Those aren’t directly enforced—they need to become real SceneConstraints attached to the downstream scenes.

When I materialize, I do two important things:

attenuate numeric thresholds using propagation depth
downgrade enforcement to advisory if the signal has decayed too far

Here’s the real mapping logic:

export function materializePropagated(
  propagated: PropagatedConstraint[]
): SceneConstraint[] {
  return propagated.map(p => ({
    ...p.sourceConstraint,
    id: crypto.randomUUID(),
    threshold: p.sourceConstraint.threshold !== undefined
      ? attenuateThreshold(p.sourceConstraint.threshold, p.propagationDepth)
      : undefined,
    enforcement: p.attenuationFactor < 0.5 ? "advisory" as const : p.sourceConstraint.enforcement,
    metadata: {
      ...p.sourceConstraint.metadata,
      propagatedFrom: p.sourceSceneId,
      propagationDepth: p.propagationDepth,
      attenuationFactor: p.attenuationFactor,
    },
  }))
}

Two details matter in practice:

The downgrade rule is explicit: p.attenuationFactor < 0.5 flips enforcement to advisory.
I persist the provenance in metadata (propagatedFrom, propagationDepth, attenuationFactor) so the rest of the pipeline can explain why a constraint exists.

That downgrade is a pressure release valve. If you keep everything “hard” forever, you end up fighting intentional story shifts. If you let everything go soft immediately, you drift. This rule gives the graph a spine, not a straitjacket.

2) The scene bridge: turning “what matters” into a transition plan

Once constraints exist on a downstream scene, I still need to turn them into action.

That’s what the scene bridge does: it fuses three layers of features into a TransitionPlan, then derives two concrete outputs that the generator can actually use:

recommendedStrength (an img2img strength in a bounded range)
promptModifiers (short, explicit instructions)

The strength calculation is where I compress a bunch of competing forces into one knob.

Here’s the real computeRecommendedStrength:

function computeRecommendedStrength(l1: L1Features, l2: L2Features, l3: L3Features): number {
  let strength = 0.85 - (l3.narrativeCoherence * 0.5)
  strength -= (l2.identityPreservation - 0.5) * 0.2
  if (l2.sceneShift === "high") strength += 0.1
  if (l2.sceneShift === "low") strength -= 0.1
  const motionVec = motionToVector(l1.motionContinuation)
  const hasMotion = Math.sqrt(motionVec[0] ** 2 + motionVec[1] ** 2) > 0.1
  if (hasMotion) strength -= 0.05
  return Math.max(0.30, Math.min(0.85, strength))
}

What I like about this function is that it’s opinionated in exactly the way a generator needs:

high narrative coherence pushes strength down (stay closer)
high identity preservation pushes strength down (don’t mutate faces)
a “high” scene shift pushes strength up (allow change)
motion continuity pushes strength down a bit (keep structure stable)
final clamp to [0.30, 0.85] keeps the system out of pathological extremes

The limitation is obvious too: it collapses a multi-dimensional plan into a scalar. That’s why the prompt modifiers matter.

Prompt modifiers: small text, sharp intent

The bridge also produces modifiers that explicitly tell the generator what to preserve.

The part I rely on most is the preservation-priority switch—because it’s the difference between “keep the face” and “keep the world”:

case "character": modifiers.push("Focus on maintaining character identity, facial features, and clothing."); break
case "environment": modifiers.push("Focus on maintaining environment, spatial layout, and setting consistency."); break
case "mood": modifiers.push("Focus on maintaining mood, atmosphere, and tonal consistency."); break

This is one of those places where being blunt beats being clever. The model is going to hallucinate if you leave it room.

3) Progressive pipeline: diagnose, correct, and only then retry

Even with a good plan, generation is stochastic. So I don’t pretend I’ll nail it on the first attempt.

Instead, I run a three-stage progressive loop:

Stage 1 (Alignment): explore via think frames, pick a path, then do full generation
Stage 2 (Refinement): if quality is below the gate, identify weak signals and apply targeted corrections
Stage 3 (Recovery): last-resort cascade that prefers “consistent” over “creative”

The important part isn’t that it retries. It’s that it retries surgically.

Diagnosis: turn weak signals into concrete actions

The function that makes refinement feel like engineering instead of gambling is diagnoseAndCorrect.

It takes:

signals: RewardSignals
plan: TransitionPlan
currentPrompt: string
threshold = 0.5

…and returns a list of Correction objects.

Here’s the real implementation:

export function diagnoseAndCorrect(
  signals: RewardSignals, plan: TransitionPlan, currentPrompt: string, threshold = 0.5
): Correction[] {
  const weakSignals = identifyWeakSignals(signals, threshold)
  const corrections: Correction[] = []
  for (const signal of weakSignals) {
    switch (signal) {
      case "colorHarmony":
        corrections.push({ signal: "colorHarmony", action: "inject_color",
          promptModifier: `Maintain exact color palette with dominant tone ${plan.colorTarget}. Use ${plan.brightnessZone} lighting.` })
        break
      case "compositionStability":
        corrections.push({ signal: "compositionStability", action: "reduce_strength",
          strengthAdjustment: -0.15 })
        break
      case "motionContinuity":
        corrections.push({ signal: "motionContinuity", action: "inject_motion",
          promptModifier: plan.motionContinuation !== "static camera"
            ? `Continue ${plan.motionContinuation} camera motion from previous scene.` : undefined })
        break
      case "visualDrift":
        corrections.push({ signal: "visualDrift", action: "increase_fidelity",
          strengthAdjustment: 0.10, promptModifier: "Preserve exact visual appearance from reference frame." })
        break
      case "narrativeCoherence":
        corrections.push({ signal: "narrativeCoherence", action: "rewrite_prompt",
          promptModifier: `Transitioning smoothly: ${plan.intendedChange}. Maintain visual continuity.` })
        break
    }
  }
  return corrections
}

Two things surprised me when I first wired this in:

Lowering strength is my go-to fix for composition instability (-0.15). That sounds counterintuitive until you see how often “composition drift” is just “the model took too much freedom.”
Color fixes want redundancy. The modifier repeats both palette and lighting because the model will obey one and ignore the other unless you box it in.

The tradeoff is that corrections can fight each other. If visualDrift says “increase fidelity” (+0.10) while compositionStability says “reduce strength” (-0.15), you’re now negotiating. That’s not a bug; it’s the reality of multi-objective control.

Stage gates: explicit thresholds

The progressive pipeline uses stage gates to decide whether to accept output or keep pushing:

Stage 1 threshold: 0.70
Stage 2 threshold: 0.60

I’m calling those out because they’re not “magic defaults”—they’re the explicit gates that make the loop closed. If you don’t gate, you don’t have a control system; you have a hope system.

Recovery: the cascade that refuses to die

Stage 3 exists because real pipelines need a “finish the job” mode.

The recovery cascade is intentionally conservative:

try very low strength (0.35) with an explicit “Preserve exact visual appearance...” instruction
fall back to the second-best think frame if it exists
if all else fails, use the source frame as-is

Here’s the real shape of that cascade from my code:

async function stage3(
  sourceImageUrl: string,
  basePrompt: string,
  plan: TransitionPlan,
  thinkFrameResult: ThinkFrameResult | null,
  deps: PipelineDeps,
  config: PipelineConfig,
  sourceMotionDirection?: string
): Promise<{ imageUrl: string; signals: RewardSignals; compositeScore: number; attempts: number }> {
  // Strategy 1: very low strength to preserve source structure
  const recoveryUrl = await deps.generateFullQuality({
    sourceImageUrl,
    prompt: `${basePrompt} Preserve exact visual appearance and composition from reference frame.`,
    strength: 0.35,
    colorPalette: plan.colorTarget !== "#808080"
      ? { dominant: plan.colorTarget, palette: plan.colorPalette, description: "" }
      : undefined,
    motionDirection: sourceMotionDirection,
  })

  if (recoveryUrl) {
    const signals = await scoreKeyframe(sourceImageUrl, recoveryUrl, plan, deps.clipScorer, sourceMotionDirection)
    return {
      imageUrl: recoveryUrl,
      signals,
      compositeScore: computeCompositeScore(signals, config.weights),
      attempts: 1,
    }
  }

  // Strategy 2: second-best think frame
  if (thinkFrameResult && thinkFrameResult.scoredFrames.length > 1) {
    const secondBest = thinkFrameResult.scoredFrames[1]
    return {
      imageUrl: secondBest.candidate.imageUrl,
      signals: secondBest.signals,
      compositeScore: computeCompositeScore(secondBest.signals, config.weights),
      attempts: 0,
    }
  }

  // Strategy 3: source frame as-is (ultimate fallback)
  return {
    imageUrl: sourceImageUrl,
    signals: {
      visualDrift: 1.0,
      colorHarmony: 1.0,
      motionContinuity: 1.0,
      compositionStability: 1.0,
      narrativeCoherence: plan.narrativeCoherence,
    },
    compositeScore: 1.0,
    attempts: 0,
  }
}

I built this because there’s a nasty failure mode in long projects: a single scene that fails hard can poison everything downstream. Stage 3 is me explicitly choosing “boring but consistent” over “creative but broken.”

What went wrong (and why the re-anchor exists)

My first mental model was: “If each scene is consistent with the previous one, the chain will stay consistent.”

That’s almost true—and it’s exactly the kind of almost-true that ruins long-form generation.

Local consistency without re-anchoring behaves like a slow random walk. Even if each step is mostly correct, bias accumulates. The fix is encoded directly in propagation:

attenuation prevents constraints from becoming immortal
refresh prevents constraints from evaporating into noise

That REFRESH_INTERVAL = 5 isn’t a performance hack. It’s a drift control mechanism.

Nuances that make the loop work in practice

A few design choices look small in code but matter a lot in behavior.

Advisory downgrade is how I avoid fighting intentional change

p.attenuationFactor < 0.5 downgrading enforcement to advisory is the system admitting something honest:

At distance, constraints are less trustworthy. The story may have legitimately moved on.

If you keep enforcing hard constraints deep into a chain, you get the uncanny effect where the model keeps dragging old identity/style into scenes that should have diverged. Downgrade gives you a graceful fade-out.

Strength clamps are guardrails, not tuning

Math.max(0.30, Math.min(0.85, strength)) is me refusing to let a heuristic pretend it’s smarter than the model.

Strength outside that band tends to create extreme behavior: either the model ignores the reference, or it refuses to introduce anything new. Keeping it bounded makes the rest of the loop (think frames + corrections) do the real work. For practical background on how img2img's "strength" parameter behaves and why clamping it matters when you want to balance preservation vs. change, see the implementation notes in the Hugging Face Diffusers img2img pipeline docs.

Corrections are written like a checklist on purpose

Look at the correction modifiers:

“Maintain exact color palette…”
“Continue camera motion…”
“Preserve exact visual appearance…”
“Transitioning smoothly: … Maintain visual continuity.”

They’re not poetic. They’re not trying to be.

The entire point of the closed loop is that the system can say: “this specific thing is weak; do this specific thing next.” Anything vaguer is just re-rolling dice.

Closing

The reason Scene 12 can stay faithful to Scene 1 in Scenematic isn’t that I built a better memory—it’s that I stopped asking memory to do a control system’s job. Propagation defines what must persist, the bridge turns that into a plan, and the progressive loop measures, diagnoses, and corrects until the chain behaves like a graph with physics instead of a prompt with optimism.

SOURCES

Hugging Face Diffusers — Stable Diffusion img2img pipeline docs (notes on how the "strength" parameter trades off preservation vs. change): https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion_img2img

🎧 Listen to the audiobook — Spotify · Google Play · All platforms
🎬 Watch the visual overviews on YouTube
📖 Read the full 13-part series with AI assistant

DEV Community