DEV Community

Cover image for Loop Engineering: The 14-step roadmap from prompter to loop designer
Sensei
Sensei

Posted on

Loop Engineering: The 14-step roadmap from prompter to loop designer

The production incident report landed in my inbox at 2:47 AM on a Tuesday.

A customer's refund request had bounced between three different LLM calls — classification, policy lookup, response generation — and somehow hallucinated a 90-day return window that didn't exist. The prompt chain looked clean on paper. Each step worked in isolation. But the loop connecting them had no memory, no guardrails, and no way to course-correct when the classifier drifted.

I'd written the prompts. I'd stitched the chain. I called it an "agentic workflow." It was really just a fragile pipeline wearing a trench coat.

That night forced a distinction I now teach every engineer I work with: Prompting is asking a model to think. Loop engineering is building a system that thinks reliably.


What is loop engineering?

Loop engineering is the discipline of designing, instrumenting, and operating recursive LLM execution cycles that maintain state, enforce invariants, and converge on correct outcomes across unbounded interaction horizons — unlike prompt chains, which execute once and hope for the best.

Not a prompt chain that calls an LLM three times in sequence. Not a RAG pipeline with a reranker. A loop: an LLM that observes, decides, acts, observes the result, and decides again — with persistent context, explicit termination conditions, and observable intermediate states at every iteration.

The boundary is sharp. A prompt chain is a DAG. A loop is a state machine with an LLM as the transition function. The chain asks "what's the answer?" The loop asks "what's the next action given everything that happened so far?" That difference — statefulness with feedback — is why loops handle ambiguity, tool failure, and multi-step reasoning where chains collapse.

The flywheel: observe → decide → act → observe. Each turn tightens the context. The loop converges or it hits a guardrail. That convergence guarantee is the product.


Why the promtper-to-loop-designer shift matters

You're already writing prompts. Maybe you've built chains. The market doesn't need more prompt engineers. It needs engineers who can close the loop — who understand that reliability comes from structure, not clever wording.

Dimension Prompter Loop Designer
Mental model Single-turn optimization Multi-turn convergence
Failure mode "Fix the prompt" "Fix the invariant"
State Implicit, in context window Explicit, persisted, versioned
Debugging Re-run and pray Replay with time-travel
Scaling Longer prompts More loops, tighter guards
Ownership "It works on my machine" "It converges in production"

The shift isn't semantic. It's the difference between a demo that impresses investors and a system that survives Black Friday.


The 14-step roadmap

Phase 1: Foundations (Steps 1–4)

Step 1: Map the decision surface

Before writing a single loop, enumerate every decision point in your task. Not steps — decisions. A decision has alternatives, consequences, and a correctness criterion.

For a code-review loop:

  • Decision: Is this PR ready to merge? (Alternatives: approve, request changes, block)
  • Decision: Which file should I examine next? (Alternatives: 47 changed files)
  • Decision: Is this pattern a bug or a style nit? (Alternatives: block, warn, ignore)

Tool: Draw a decision tree. If it has more than 7 leaf nodes, you need a loop. If it has cycles, you definitely need a loop.

Step 2: Define the loop state schema

State is not "the conversation history." State is a typed, versioned data structure that survives context truncation and enables replay.

@dataclass(frozen=True)
class LoopState:
    iteration: int
    task_spec: TaskSpec                    # Immutable goal
    working_memory: WorkingMemory          # Scratchpad, pruned each turn
    long_term_memory: LongTermMemory       # Facts, decisions, user prefs
    tool_results: List[ToolResult]         # Append-only log
    invariants: InvariantSet               # Must hold at every iteration
    termination_signal: Optional[TerminationReason]
    version: int                           # Schema version for migration
Enter fullscreen mode Exit fullscreen mode

Rule: If you can't serialize it to JSON and replay the loop from step 3, it's not state — it's hope.

Step 3: Write the convergence contract

Every loop needs a termination predicate that is provably reachable. Not "when the model says done." A predicate over state:

def should_terminate(state: LoopState) -> TerminationReason | None:
    if state.iteration >= MAX_ITERATIONS:
        return TerminationReason.MAX_ITERATIONS_EXCEEDED
    if state.invariants.all_satisfied() and state.working_memory.is_complete():
        return TerminationReason.SUCCESS
    if state.tool_results.last().is_fatal_error():
        return TerminationReason.FATAL_ERROR
    if not state.invariants.progress_made_since(state.iteration - 3):
        return TerminationReason.STALLED
    return None
Enter fullscreen mode Exit fullscreen mode

No predicate → infinite loop → production incident. I've seen both.

Step 4: Choose your loop topology

Three patterns cover 90% of cases. Pick one. Don't hybridize until you have a reason.

Topology Structure Use when
Single-agent reflexive One LLM, internal monologue, tool calls Deterministic tasks, clear specs, low ambiguity
Multi-agent debate Proposer + Critic + Judge (or n-persona ensemble) Subjective quality, adversarial verification needed
Hierarchical decomposition Planner → Executor(s) → Aggregator Long horizons, parallelizable subtasks, distinct skills

My default: Single-agent reflexive with a critic pass at the end. Simpler to debug. Harder to fool yourself.


Phase 2: The loop body (Steps 5–9)

Step 5: Design the observation function

The observation function converts raw tool outputs, environment events, and prior decisions into the working memory for the next turn. It's the loop's eyes.

def observe(state: LoopState, tool_result: ToolResult) -> WorkingMemory:
    # 1. Extract structured facts (not summaries)
    facts = extract_facts(tool_result)

    # 2. Score relevance to current subgoal
    scored = score_relevance(facts, state.working_memory.current_subgoal)

    # 3. Prune aggressively — context window is budget
    pruned = prune_to_budget(scored, budget=TOKEN_BUDGET * 0.3)

    # 4. Flag contradictions with long-term memory
    contradictions = detect_contradictions(pruned, state.long_term_memory)

    return WorkingMemory(
        facts=pruned,
        contradictions=contradictions,
        current_subgoal=state.working_memory.current_subgoal,
        token_count=count_tokens(pruned)
    )
Enter fullscreen mode Exit fullscreen mode

Lesson learned: Summarization loses signal. Extraction preserves it. Store facts, not narratives.

Step 6: Engineer the decision prompt

The decision prompt is the loop's brain. It receives: task spec + working memory + long-term memory + invariants. It outputs: next action + updated subgoal + confidence.

Template structure that works:

<task_spec>{{task_spec}}</task_spec>
<invariants>{{invariants.render()}}</invariants>
<working_memory>{{working_memory.render()}}</working_memory>
<long_term_memory>{{long_term_memory.render(relevance_threshold=0.7)}}</long_term_memory>

<instruction>
You are an iteration in a convergent loop. Your job: propose the single highest-leverage next action.
Output JSON only:
{
  "action": {"tool": "...", "args": {...}},
  "subgoal": "specific, measurable, time-bounded",
  "confidence": 0.0-1.0,
  "reasoning": "one paragraph, cites specific facts from memory"
}
</instruction>
Enter fullscreen mode Exit fullscreen mode

Critical: The prompt must reference invariants by name. The model needs to know what it's not allowed to violate.

Step 7: Build the action executor with idempotency

Every tool call in a loop must be idempotent or compensatable. No exceptions.

class IdempotentToolExecutor:
    def execute(self, action: Action, state: LoopState) -> ToolResult:
        # Generate deterministic key from action + state snapshot
        idempotency_key = hash((action.tool, action.args, state.iteration))

        # Check if we've already executed this exact action
        if cached := self.cache.get(idempotency_key):
            return ToolResult.from_cache(cached)

        # Execute with timeout and retry policy
        result = self._execute_with_policy(action)

        # Store for replay
        self.cache.set(idempotency_key, result)
        return result
Enter fullscreen mode Exit fullscreen mode

War story: A loop retried a non-idempotent "send email" tool 47 times during a model hallucination spiral. Customer got 47 emails. We now wrap every external call.

Step 8: Implement invariant enforcement

Invariants are the loop's guardrails. They run after observation, before the next decision. Hard stops.

class InvariantSet:
    def __init__(self):
        self.invariants = [
            Invariant(
                name="budget_not_exceeded",
                check=lambda s: s.working_memory.token_count < TOKEN_BUDGET,
                remediation="summarize_and_prune"
            ),
            Invariant(
                name="no_contradictory_facts",
                check=lambda s: not s.working_memory.contradictions,
                remediation="escalate_to_human"
            ),
            Invariant(
                name="progress_per_three_iterations",
                check=lambda s: s.invariants.progress_made_since(s.iteration - 3),
                remediation="change_strategy"
            ),
        ]

    def enforce(self, state: LoopState) -> LoopState:
        for inv in self.invariants:
            if not inv.check(state):
                state = inv.remediation(state)
                log.warning(f"Invariant violated: {inv.name}, applied {inv.remediation}")
        return state
Enter fullscreen mode Exit fullscreen mode

Design principle: Every invariant has a remediation, not just a failure. The loop heals itself.

Step 9: Close the long-term memory loop

Working memory is per-iteration. Long-term memory accumulates across loops — and across sessions.

class LongTermMemory:
    def __init__(self, vector_store: VectorStore, kv_store: KVStore):
        self.vector = vector_store      # Semantic facts, embeddings
        self.kv = kv_store              # Structured: user prefs, decisions, schemas

    def consolidate(self, state: LoopState) -> None:
        # Extract durable facts from this loop's trajectory
        facts = extract_durable_facts(state.tool_results)
        for fact in facts:
            self.vector.upsert(embedding=fact.embedding, metadata=fact.meta)

        # Record decisions for future loops
        for decision in state.decisions:
            self.kv.set(f"decision:{decision.id}", decision.json())

    def retrieve(self, query: str, k: int = 5) -> List[Fact]:
        return self.vector.search(query, k=k)
Enter fullscreen mode Exit fullscreen mode

Why it matters: A loop that learns from its own history converges faster. A loop that forgets repeats mistakes.


Phase 3: Production hardening (Steps 10–14)

Step 10: Instrument for time-travel debugging

You will need to answer "why did the loop do that at iteration 7?" — three months from now.

class LoopTracer:
    def __init__(self, event_store: EventStore):
        self.store = event_store

    def trace_iteration(self, state: LoopState, decision: Decision, result: ToolResult):
        self.store.append(LoopEvent(
            loop_id=state.loop_id,
            iteration=state.iteration,
            timestamp=utc_now(),
            state_snapshot=state.json(),           # Full state
            decision=decision.json(),              # What the model chose
            prompt=decision.prompt_sent,           # Exact prompt
            response=decision.raw_response,        # Exact response
            tool_result=result.json(),             # What happened
            invariants=state.invariants.status(),  # Pass/fail per invariant
        ))
Enter fullscreen mode Exit fullscreen mode

Query pattern: SELECT * FROM events WHERE loop_id = ? ORDER BY iteration. Replay any iteration in a notebook. Compare branches. This is how you actually improve loops.

Step 11: Build the evaluation harness

Unit tests for loops don't test prompts. They test convergence properties.

class LoopEvaluator:
    def evaluate_convergence(self, loop: Loop, test_cases: List[TestCase]) -> EvaluationReport:
        results = []
        for tc in test_cases:
            trace = loop.run(tc.input, max_iterations=20)
            results.append(ConvergenceResult(
                test_case=tc.id,
                terminated=trace.terminated,
                termination_reason=trace.termination_reason,
                iterations=trace.iterations,
                final_state_valid=tc.validator(trace.final_state),
                invariant_violations=trace.invariant_violations,
                token_cost=trace.total_tokens,
                latency_p95=trace.latency_p95,
            ))
        return EvaluationReport(results)
Enter fullscreen mode Exit fullscreen mode

Metrics that matter: Convergence rate (not accuracy — convergence), iterations to convergence, invariant violation rate, cost per convergence. Accuracy is a consequence of convergence.

Step 12: Design the human-in-the-loop escalation

Loops stall. Invariants conflict. Models hallucinate. You need a clean handoff protocol, not a Slack alert.

@dataclass
class EscalationPacket:
    loop_id: str
    stuck_at_iteration: int
    state_snapshot: LoopState
    failing_invariants: List[InvariantViolation]
    model_confidence_trajectory: List[float]
    suggested_interventions: List[Intervention]  # Pre-computed by a "meta-loop"
    context_summary: str  # Human-readable, < 500 tokens
Enter fullscreen mode Exit fullscreen mode

UI requirement: The human sees the loop's reasoning trail, not just the current state. They can patch state (add a fact, override a subgoal, relax an invariant) and resume. The loop continues from the patched state.

Step 13: Implement cost and latency budgets

Loops are expensive. A 15-iteration loop with 8k context each turn = ~$2-5 per run at 2024 prices. You need budget-aware scheduling.

class BudgetScheduler:
    def __init__(self, daily_budget_usd: float, max_latency_s: float):
        self.daily_budget = daily_budget_usd
        self.max_latency = max_latency_s
        self.spent_today = 0.0

    def can_launch(self, estimated_cost: float, estimated_latency: float) -> bool:
        return (self.spent_today + estimated_cost <= self.daily_budget and
                estimated_latency <= self.max_latency)

    def record_run(self, actual_cost: float):
        self.spent_today += actual_cost
Enter fullscreen mode Exit fullscreen mode

Operational rule: If a loop type exceeds budget 3 days in a row, you don't get more budget — you get a redesign ticket.

Step 14: Ship the loop lifecycle

Loop engineering doesn't end at deploy. You need:

Phase Artifact Cadence
Development Local replay harness, synthetic test cases Per PR
Staging Shadow mode (loop runs, human decides, compare) 2 weeks
Canary 1% traffic, full tracing, auto-rollback on invariant spike 1 week
Production Daily convergence report, weekly invariant audit, monthly schema migration Ongoing
Retirement Migration plan for dependent loops, data export When replaced

The meta-loop: Treat your loop fleet as a system. A "loop registry" tracks versions, dependencies, convergence SLAs, and ownership. Loops call loops. The registry prevents circular dependencies and cascade failures.


Practical example: Automated code review loop

Let's walk a real loop through the 14 steps.

The task

Review PRs for: security issues, performance regressions, API compatibility breaks, and test coverage gaps. Post inline comments. Approve or request changes.

Loop state (Step 2)

@dataclass
class CodeReviewState(LoopState):
    pr_metadata: PRMetadata
    file_analyses: Dict[str, FileAnalysis]      # Per-file results
    cross_file_findings: List[CrossFileFinding] # Patterns spanning files
    review_comments: List[ReviewComment]        # Accumulated output
    files_remaining: List[str]                  # Work queue
Enter fullscreen mode Exit fullscreen mode

Invariants (Step 3)

  1. files_remaining strictly decreases each iteration
  2. No file analyzed twice (idempotency)
  3. Total comments < 50 (prevent spam)
  4. Critical findings block approval

Topology (Step 4)

Hierarchical: Planner (prioritizes files) → Analyzers (parallel, per-file) → Aggregator (synthesizes, decides approval).

Observation function (Step 5)

Extracts: AST patterns, dependency changes, test diffs, prior review comments on same files. Prunes to 3k tokens per analyzer.

Decision prompt (Step 6)

Analyzer prompt receives: file diff + relevant context + security rules + performance patterns. Outputs: {"findings": [...], "confidence": 0.87, "needs_human": false}.

Action executor (Step 7)

Tools: get_file_diff, search_codebase, post_review_comment. All idempotent via (pr_number, file_path, comment_line) keys.

Invariant enforcement (Step 8)

After each analyzer: check comment budget, check duplicate findings, verify confidence > 0.6 for auto-posted comments.

Long-term memory (Step 9)

Stores: repo-specific patterns (e.g., "this team uses custom auth middleware"), false positive history, style guide embeddings.

Results after 3 months in production

Metric Before (prompt chain) After (loop)
False positive rate 34% 8%
Missed critical issues 12% 2%
Avg iterations N/A (single pass) 4.2
Human escalation rate 28% 6%
Cost per PR $0.12 $0.31

The cost increased. The quality converged. That's the trade. Loops aren't cheaper — they're reliable.


Use cases where loops win

Domain Why loops > chains
Incident response Unknown steps, tool failures, need to pivot mid-stream
Complex code generation Compile → test → fix cycles, cross-file consistency
Research synthesis Iterative deepening, contradiction resolution, citation verification
Customer support resolution Multi-turn dialogue, policy lookup, action execution, confirmation
Data pipeline debugging Hypothesis → query → observe → refine, schema drift handling
Compliance auditing Evidence gathering, rule interpretation, gap tracking, remediation

Common thread: The path to the answer isn't known upfront. The system must discover it through interaction.


Pros and cons

Pros

  • Convergence guarantees — Invariants + termination predicates = bounded behavior
  • Observability — Every iteration is a checkpoint. Time-travel debugging is real.
  • Self-healing — Remediation actions let loops recover from common failures
  • Learning — Long-term memory compounds value across runs
  • Human-AI collaboration — Clean escalation with state patching, not context dumping

Cons

  • Latency — 5-20 iterations × 2-8s each = 10-160s. Not for user-facing sync paths.
  • Cost — 3-10x prompt chains. Budget governance is mandatory.
  • Complexity — State schema, invariants, tracing, evaluation harness. Real engineering investment.
  • Debugging difficulty — Emergent behavior in multi-agent loops can surprise even the designer.
  • Model dependency — Convergence quality tracks model reasoning capability. GPT-4o loops converge where 3.5-turbo loops stall.

The mindset shift

Prompt engineering asks: "How do I get the model to give me the right answer?"

Loop engineering asks: "How do I build a system that converges on the right answer even when the model gives me the wrong one?"

The second question changes everything. You stop optimizing prompts. You start designing state machines with LLMs as transition functions. You instrument. You invariant. You evaluate convergence, not accuracy.

That night at 2:47 AM, the refund loop had no invariants. No termination predicate. No long-term memory of the actual return policy. It was a prompt chain in a trench coat.

The replacement loop? 14 steps. 4 invariants. A termination predicate that catches "model says done but policy not checked." A long-term memory that stores the actual policy document. An escalation packet that shows the human exactly where the model drifted.

It hasn't hallucinated a return window in 14 months.


FAQ

What's the minimum viable loop?
State schema + observation function + decision prompt + action executor + termination predicate + one invariant. Skip any component and you have a chain, not a loop.

Can I use loops with smaller/cheaper models?
Yes, but convergence iterations increase. 3.5-turbo typically needs 2-3x iterations vs 4o for same task. Cost often equals out. Latency compounds.

How do I prevent infinite loops?
Hard iteration cap (invariant) + progress invariant (must advance every 3 iterations) + stall detection (same subgoal 3x) + budget invariant. Four layers.

What's the difference between a loop and an agent?
Marketing. "Agent" implies autonomy. "Loop" implies engineering discipline. I use "loop" because it forces you to think in control theory terms.

Do I need a vector database for long-term memory?
For semantic retrieval: yes. For structured facts (decisions, prefs, schemas): a KV store is faster and cheaper. Use both.

How do I test loops locally?
Replay harness: feed recorded tool results into the loop, verify it makes same decisions. Synthetic test cases: generate 100 variations of your task, measure convergence rate. CI gate: convergence rate > 95% on test set.

What about multi-agent loops — are they worth it?
For subjective quality (writing, design review, architecture critique): yes. Proposer-critic-judge reduces hallucination 40-60% in our benchmarks. For deterministic tasks: single-agent reflexive is simpler and faster.

How do I handle schema migrations for loop state?
Version every state schema. Migration functions are pure: v1_state → v2_state. Run migrations at loop start. Keep last 3 versions runnable. Test migrations in CI.

Can loops call other loops?
Yes. Hierarchical loops. But register them in a loop registry with explicit contracts. Prevent circular calls. Monitor cascade latency.

What's the biggest mistake teams make?
Treating invariants as optional. "We'll add them later." Later never comes. The first production incident adds them for you — at 3 AM.


Related Reading


Top comments (0)