Sensei

Posted on Jun 30

Loop Engineering: The 14-step roadmap from prompter to loop designer

#agents #llm #ai #architecture

The production incident report landed in my inbox at 2:47 AM on a Tuesday.

A customer's refund request had bounced between three different LLM calls — classification, policy lookup, response generation — and somehow hallucinated a 90-day return window that didn't exist. The prompt chain looked clean on paper. Each step worked in isolation. But the loop connecting them had no memory, no guardrails, and no way to course-correct when the classifier drifted.

I'd written the prompts. I'd stitched the chain. I called it an "agentic workflow." It was really just a fragile pipeline wearing a trench coat.

That night forced a distinction I now teach every engineer I work with: Prompting is asking a model to think. Loop engineering is building a system that thinks reliably.

What is loop engineering?

Loop engineering is the discipline of designing, instrumenting, and operating recursive LLM execution cycles that maintain state, enforce invariants, and converge on correct outcomes across unbounded interaction horizons — unlike prompt chains, which execute once and hope for the best.

Not a prompt chain that calls an LLM three times in sequence. Not a RAG pipeline with a reranker. A loop: an LLM that observes, decides, acts, observes the result, and decides again — with persistent context, explicit termination conditions, and observable intermediate states at every iteration.

The boundary is sharp. A prompt chain is a DAG. A loop is a state machine with an LLM as the transition function. The chain asks "what's the answer?" The loop asks "what's the next action given everything that happened so far?" That difference — statefulness with feedback — is why loops handle ambiguity, tool failure, and multi-step reasoning where chains collapse.

The flywheel: observe → decide → act → observe. Each turn tightens the context. The loop converges or it hits a guardrail. That convergence guarantee is the product.

Why the promtper-to-loop-designer shift matters

You're already writing prompts. Maybe you've built chains. The market doesn't need more prompt engineers. It needs engineers who can close the loop — who understand that reliability comes from structure, not clever wording.

Dimension	Prompter	Loop Designer
Mental model	Single-turn optimization	Multi-turn convergence
Failure mode	"Fix the prompt"	"Fix the invariant"
State	Implicit, in context window	Explicit, persisted, versioned
Debugging	Re-run and pray	Replay with time-travel
Scaling	Longer prompts	More loops, tighter guards
Ownership	"It works on my machine"	"It converges in production"

The shift isn't semantic. It's the difference between a demo that impresses investors and a system that survives Black Friday.

The 14-step roadmap

Phase 1: Foundations (Steps 1–4)

Step 1: Map the decision surface

Before writing a single loop, enumerate every decision point in your task. Not steps — decisions. A decision has alternatives, consequences, and a correctness criterion.

For a code-review loop:

Decision: Is this PR ready to merge? (Alternatives: approve, request changes, block)
Decision: Which file should I examine next? (Alternatives: 47 changed files)
Decision: Is this pattern a bug or a style nit? (Alternatives: block, warn, ignore)

Tool: Draw a decision tree. If it has more than 7 leaf nodes, you need a loop. If it has cycles, you definitely need a loop.

Step 2: Define the loop state schema

State is not "the conversation history." State is a typed, versioned data structure that survives context truncation and enables replay.

@dataclass(frozen=True)
class LoopState:
    iteration: int
    task_spec: TaskSpec                    # Immutable goal
    working_memory: WorkingMemory          # Scratchpad, pruned each turn
    long_term_memory: LongTermMemory       # Facts, decisions, user prefs
    tool_results: List[ToolResult]         # Append-only log
    invariants: InvariantSet               # Must hold at every iteration
    termination_signal: Optional[TerminationReason]
    version: int                           # Schema version for migration

Rule: If you can't serialize it to JSON and replay the loop from step 3, it's not state — it's hope.

Step 3: Write the convergence contract

Every loop needs a termination predicate that is provably reachable. Not "when the model says done." A predicate over state:

def should_terminate(state: LoopState) -> TerminationReason | None:
    if state.iteration >= MAX_ITERATIONS:
        return TerminationReason.MAX_ITERATIONS_EXCEEDED
    if state.invariants.all_satisfied() and state.working_memory.is_complete():
        return TerminationReason.SUCCESS
    if state.tool_results.last().is_fatal_error():
        return TerminationReason.FATAL_ERROR
    if not state.invariants.progress_made_since(state.iteration - 3):
        return TerminationReason.STALLED
    return None

No predicate → infinite loop → production incident. I've seen both.

Step 4: Choose your loop topology

Three patterns cover 90% of cases. Pick one. Don't hybridize until you have a reason.

Topology	Structure	Use when
Single-agent reflexive	One LLM, internal monologue, tool calls	Deterministic tasks, clear specs, low ambiguity
Multi-agent debate	Proposer + Critic + Judge (or n-persona ensemble)	Subjective quality, adversarial verification needed
Hierarchical decomposition	Planner → Executor(s) → Aggregator	Long horizons, parallelizable subtasks, distinct skills

My default: Single-agent reflexive with a critic pass at the end. Simpler to debug. Harder to fool yourself.

Phase 2: The loop body (Steps 5–9)

Step 5: Design the observation function

The observation function converts raw tool outputs, environment events, and prior decisions into the working memory for the next turn. It's the loop's eyes.

def observe(state: LoopState, tool_result: ToolResult) -> WorkingMemory:
    # 1. Extract structured facts (not summaries)
    facts = extract_facts(tool_result)

    # 2. Score relevance to current subgoal
    scored = score_relevance(facts, state.working_memory.current_subgoal)

    # 3. Prune aggressively — context window is budget
    pruned = prune_to_budget(scored, budget=TOKEN_BUDGET * 0.3)

    # 4. Flag contradictions with long-term memory
    contradictions = detect_contradictions(pruned, state.long_term_memory)

    return WorkingMemory(
        facts=pruned,
        contradictions=contradictions,
        current_subgoal=state.working_memory.current_subgoal,
        token_count=count_tokens(pruned)
    )

Lesson learned: Summarization loses signal. Extraction preserves it. Store facts, not narratives.

Step 6: Engineer the decision prompt

The decision prompt is the loop's brain. It receives: task spec + working memory + long-term memory + invariants. It outputs: next action + updated subgoal + confidence.

Template structure that works:

<task_spec>{{task_spec}}</task_spec>
<invariants>{{invariants.render()}}</invariants>
<working_memory>{{working_memory.render()}}</working_memory>
<long_term_memory>{{long_term_memory.render(relevance_threshold=0.7)}}</long_term_memory>

<instruction>
You are an iteration in a convergent loop. Your job: propose the single highest-leverage next action.
Output JSON only:
{
  "action": {"tool": "...", "args": {...}},
  "subgoal": "specific, measurable, time-bounded",
  "confidence": 0.0-1.0,
  "reasoning": "one paragraph, cites specific facts from memory"
}
</instruction>

Critical: The prompt must reference invariants by name. The model needs to know what it's not allowed to violate.

Step 7: Build the action executor with idempotency

Every tool call in a loop must be idempotent or compensatable. No exceptions.

class IdempotentToolExecutor:
    def execute(self, action: Action, state: LoopState) -> ToolResult:
        # Generate deterministic key from action + state snapshot
        idempotency_key = hash((action.tool, action.args, state.iteration))

        # Check if we've already executed this exact action
        if cached := self.cache.get(idempotency_key):
            return ToolResult.from_cache(cached)

        # Execute with timeout and retry policy
        result = self._execute_with_policy(action)

        # Store for replay
        self.cache.set(idempotency_key, result)
        return result

War story: A loop retried a non-idempotent "send email" tool 47 times during a model hallucination spiral. Customer got 47 emails. We now wrap every external call.

Step 8: Implement invariant enforcement

Invariants are the loop's guardrails. They run after observation, before the next decision. Hard stops.

class InvariantSet:
    def __init__(self):
        self.invariants = [
            Invariant(
                name="budget_not_exceeded",
                check=lambda s: s.working_memory.token_count < TOKEN_BUDGET,
                remediation="summarize_and_prune"
            ),
            Invariant(
                name="no_contradictory_facts",
                check=lambda s: not s.working_memory.contradictions,
                remediation="escalate_to_human"
            ),
            Invariant(
                name="progress_per_three_iterations",
                check=lambda s: s.invariants.progress_made_since(s.iteration - 3),
                remediation="change_strategy"
            ),
        ]

    def enforce(self, state: LoopState) -> LoopState:
        for inv in self.invariants:
            if not inv.check(state):
                state = inv.remediation(state)
                log.warning(f"Invariant violated: {inv.name}, applied {inv.remediation}")
        return state

Design principle: Every invariant has a remediation, not just a failure. The loop heals itself.

Step 9: Close the long-term memory loop

Working memory is per-iteration. Long-term memory accumulates across loops — and across sessions.

class LongTermMemory:
    def __init__(self, vector_store: VectorStore, kv_store: KVStore):
        self.vector = vector_store      # Semantic facts, embeddings
        self.kv = kv_store              # Structured: user prefs, decisions, schemas

    def consolidate(self, state: LoopState) -> None:
        # Extract durable facts from this loop's trajectory
        facts = extract_durable_facts(state.tool_results)
        for fact in facts:
            self.vector.upsert(embedding=fact.embedding, metadata=fact.meta)

        # Record decisions for future loops
        for decision in state.decisions:
            self.kv.set(f"decision:{decision.id}", decision.json())

    def retrieve(self, query: str, k: int = 5) -> List[Fact]:
        return self.vector.search(query, k=k)

Why it matters: A loop that learns from its own history converges faster. A loop that forgets repeats mistakes.

Phase 3: Production hardening (Steps 10–14)

Step 10: Instrument for time-travel debugging

You will need to answer "why did the loop do that at iteration 7?" — three months from now.

class LoopTracer:
    def __init__(self, event_store: EventStore):
        self.store = event_store

    def trace_iteration(self, state: LoopState, decision: Decision, result: ToolResult):
        self.store.append(LoopEvent(
            loop_id=state.loop_id,
            iteration=state.iteration,
            timestamp=utc_now(),
            state_snapshot=state.json(),           # Full state
            decision=decision.json(),              # What the model chose
            prompt=decision.prompt_sent,           # Exact prompt
            response=decision.raw_response,        # Exact response
            tool_result=result.json(),             # What happened
            invariants=state.invariants.status(),  # Pass/fail per invariant
        ))

Query pattern: SELECT * FROM events WHERE loop_id = ? ORDER BY iteration. Replay any iteration in a notebook. Compare branches. This is how you actually improve loops.

Step 11: Build the evaluation harness

Unit tests for loops don't test prompts. They test convergence properties.

class LoopEvaluator:
    def evaluate_convergence(self, loop: Loop, test_cases: List[TestCase]) -> EvaluationReport:
        results = []
        for tc in test_cases:
            trace = loop.run(tc.input, max_iterations=20)
            results.append(ConvergenceResult(
                test_case=tc.id,
                terminated=trace.terminated,
                termination_reason=trace.termination_reason,
                iterations=trace.iterations,
                final_state_valid=tc.validator(trace.final_state),
                invariant_violations=trace.invariant_violations,
                token_cost=trace.total_tokens,
                latency_p95=trace.latency_p95,
            ))
        return EvaluationReport(results)

Metrics that matter: Convergence rate (not accuracy — convergence), iterations to convergence, invariant violation rate, cost per convergence. Accuracy is a consequence of convergence.

Step 12: Design the human-in-the-loop escalation

Loops stall. Invariants conflict. Models hallucinate. You need a clean handoff protocol, not a Slack alert.

@dataclass
class EscalationPacket:
    loop_id: str
    stuck_at_iteration: int
    state_snapshot: LoopState
    failing_invariants: List[InvariantViolation]
    model_confidence_trajectory: List[float]
    suggested_interventions: List[Intervention]  # Pre-computed by a "meta-loop"
    context_summary: str  # Human-readable, < 500 tokens

UI requirement: The human sees the loop's reasoning trail, not just the current state. They can patch state (add a fact, override a subgoal, relax an invariant) and resume. The loop continues from the patched state.

Step 13: Implement cost and latency budgets

Loops are expensive. A 15-iteration loop with 8k context each turn = ~$2-5 per run at 2024 prices. You need budget-aware scheduling.

class BudgetScheduler:
    def __init__(self, daily_budget_usd: float, max_latency_s: float):
        self.daily_budget = daily_budget_usd
        self.max_latency = max_latency_s
        self.spent_today = 0.0

    def can_launch(self, estimated_cost: float, estimated_latency: float) -> bool:
        return (self.spent_today + estimated_cost <= self.daily_budget and
                estimated_latency <= self.max_latency)

    def record_run(self, actual_cost: float):
        self.spent_today += actual_cost

Operational rule: If a loop type exceeds budget 3 days in a row, you don't get more budget — you get a redesign ticket.

Step 14: Ship the loop lifecycle

Loop engineering doesn't end at deploy. You need:

Phase	Artifact	Cadence
Development	Local replay harness, synthetic test cases	Per PR
Staging	Shadow mode (loop runs, human decides, compare)	2 weeks
Canary	1% traffic, full tracing, auto-rollback on invariant spike	1 week
Production	Daily convergence report, weekly invariant audit, monthly schema migration	Ongoing
Retirement	Migration plan for dependent loops, data export	When replaced

The meta-loop: Treat your loop fleet as a system. A "loop registry" tracks versions, dependencies, convergence SLAs, and ownership. Loops call loops. The registry prevents circular dependencies and cascade failures.

Practical example: Automated code review loop

Let's walk a real loop through the 14 steps.

The task

Review PRs for: security issues, performance regressions, API compatibility breaks, and test coverage gaps. Post inline comments. Approve or request changes.

Loop state (Step 2)

@dataclass
class CodeReviewState(LoopState):
    pr_metadata: PRMetadata
    file_analyses: Dict[str, FileAnalysis]      # Per-file results
    cross_file_findings: List[CrossFileFinding] # Patterns spanning files
    review_comments: List[ReviewComment]        # Accumulated output
    files_remaining: List[str]                  # Work queue

Invariants (Step 3)

files_remaining strictly decreases each iteration
No file analyzed twice (idempotency)
Total comments < 50 (prevent spam)
Critical findings block approval

Topology (Step 4)

Hierarchical: Planner (prioritizes files) → Analyzers (parallel, per-file) → Aggregator (synthesizes, decides approval).

Observation function (Step 5)

Extracts: AST patterns, dependency changes, test diffs, prior review comments on same files. Prunes to 3k tokens per analyzer.

Decision prompt (Step 6)

Analyzer prompt receives: file diff + relevant context + security rules + performance patterns. Outputs: {"findings": [...], "confidence": 0.87, "needs_human": false}.

Action executor (Step 7)

Tools: get_file_diff, search_codebase, post_review_comment. All idempotent via (pr_number, file_path, comment_line) keys.

Invariant enforcement (Step 8)

After each analyzer: check comment budget, check duplicate findings, verify confidence > 0.6 for auto-posted comments.

Long-term memory (Step 9)

Stores: repo-specific patterns (e.g., "this team uses custom auth middleware"), false positive history, style guide embeddings.

Results after 3 months in production

Metric	Before (prompt chain)	After (loop)
False positive rate	34%	8%
Missed critical issues	12%	2%
Avg iterations	N/A (single pass)	4.2
Human escalation rate	28%	6%
Cost per PR	$0.12	$0.31

The cost increased. The quality converged. That's the trade. Loops aren't cheaper — they're reliable.

Use cases where loops win

Domain	Why loops > chains
Incident response	Unknown steps, tool failures, need to pivot mid-stream
Complex code generation	Compile → test → fix cycles, cross-file consistency
Research synthesis	Iterative deepening, contradiction resolution, citation verification
Customer support resolution	Multi-turn dialogue, policy lookup, action execution, confirmation
Data pipeline debugging	Hypothesis → query → observe → refine, schema drift handling
Compliance auditing	Evidence gathering, rule interpretation, gap tracking, remediation

Common thread: The path to the answer isn't known upfront. The system must discover it through interaction.

Pros and cons

Pros

Convergence guarantees — Invariants + termination predicates = bounded behavior
Observability — Every iteration is a checkpoint. Time-travel debugging is real.
Self-healing — Remediation actions let loops recover from common failures
Learning — Long-term memory compounds value across runs
Human-AI collaboration — Clean escalation with state patching, not context dumping

Cons

Latency — 5-20 iterations × 2-8s each = 10-160s. Not for user-facing sync paths.
Cost — 3-10x prompt chains. Budget governance is mandatory.
Complexity — State schema, invariants, tracing, evaluation harness. Real engineering investment.
Debugging difficulty — Emergent behavior in multi-agent loops can surprise even the designer.
Model dependency — Convergence quality tracks model reasoning capability. GPT-4o loops converge where 3.5-turbo loops stall.

The mindset shift

Prompt engineering asks: "How do I get the model to give me the right answer?"

Loop engineering asks: "How do I build a system that converges on the right answer even when the model gives me the wrong one?"

The second question changes everything. You stop optimizing prompts. You start designing state machines with LLMs as transition functions. You instrument. You invariant. You evaluate convergence, not accuracy.

That night at 2:47 AM, the refund loop had no invariants. No termination predicate. No long-term memory of the actual return policy. It was a prompt chain in a trench coat.

The replacement loop? 14 steps. 4 invariants. A termination predicate that catches "model says done but policy not checked." A long-term memory that stores the actual policy document. An escalation packet that shows the human exactly where the model drifted.

It hasn't hallucinated a return window in 14 months.

FAQ

What's the minimum viable loop?
State schema + observation function + decision prompt + action executor + termination predicate + one invariant. Skip any component and you have a chain, not a loop.

Can I use loops with smaller/cheaper models?
Yes, but convergence iterations increase. 3.5-turbo typically needs 2-3x iterations vs 4o for same task. Cost often equals out. Latency compounds.

How do I prevent infinite loops?
Hard iteration cap (invariant) + progress invariant (must advance every 3 iterations) + stall detection (same subgoal 3x) + budget invariant. Four layers.

What's the difference between a loop and an agent?
Marketing. "Agent" implies autonomy. "Loop" implies engineering discipline. I use "loop" because it forces you to think in control theory terms.

Do I need a vector database for long-term memory?
For semantic retrieval: yes. For structured facts (decisions, prefs, schemas): a KV store is faster and cheaper. Use both.

How do I test loops locally?
Replay harness: feed recorded tool results into the loop, verify it makes same decisions. Synthetic test cases: generate 100 variations of your task, measure convergence rate. CI gate: convergence rate > 95% on test set.

What about multi-agent loops — are they worth it?
For subjective quality (writing, design review, architecture critique): yes. Proposer-critic-judge reduces hallucination 40-60% in our benchmarks. For deterministic tasks: single-agent reflexive is simpler and faster.

How do I handle schema migrations for loop state?
Version every state schema. Migration functions are pure: v1_state → v2_state. Run migrations at loop start. Keep last 3 versions runnable. Test migrations in CI.

Can loops call other loops?
Yes. Hierarchical loops. But register them in a loop registry with explicit contracts. Prevent circular calls. Monitor cascade latency.

What's the biggest mistake teams make?
Treating invariants as optional. "We'll add them later." Later never comes. The first production incident adds them for you — at 3 AM.

Top comments (2)

Mike Czerwinski • Jun 30

This is the most honest loop write-up I've read, especially "convergence rate, not accuracy" and the idempotency war story. One thing I'd pull to the front. In your topology table the multi-agent debate (Proposer/Critic/Judge) gets the spotlight, but the Judge is another LLM, same model class, same blind spots as the proposer. Decorrelating the role doesn't decorrelate the weights, so two instances can converge happily on a confident-wrong fixpoint. The genuinely exogenous verifier in your whole design is the invariant set. should_terminate and the InvariantSet can't be sweet-talked because they're code, not a model. That's the part that actually bounds behavior. I'd argue the invariants are the product and the critic agent is a nice-to-have, not the other way round. Have you seen debate loops pass a check that a hard invariant would have caught, or does the Judge mostly earn its cost?

Sensei • Jul 1

That's Correct, Your observation Acutally. Decorrelating the role does not de-correlate the weights, is such a good way to put it. I have absolutely seen a Proposer, Critic, and Judge all happily high-five each other over a completely broken piece of code just because the LLM made the explanation sound super fancy and smart.

To answer your question: yeah, the Judge definitely fails where a hard invariant would’ve caught it instantly. LLMs check for vibes and plausibility, but code invariants check for actual reality. You can't gaslight a unit test or a compiler LoL 😅.

I really like your point that invariants are the actual product and the critic is just a nice-to-have. Definitely gonna think about moving that concept to the front.