The production incident report landed in my inbox at 2:47 AM on a Tuesday.
A customer's refund request had bounced between three different LLM calls — classification, policy lookup, response generation — and somehow hallucinated a 90-day return window that didn't exist. The prompt chain looked clean on paper. Each step worked in isolation. But the loop connecting them had no memory, no guardrails, and no way to course-correct when the classifier drifted.
I'd written the prompts. I'd stitched the chain. I called it an "agentic workflow." It was really just a fragile pipeline wearing a trench coat.
That night forced a distinction I now teach every engineer I work with: Prompting is asking a model to think. Loop engineering is building a system that thinks reliably.
What is loop engineering?
Loop engineering is the discipline of designing, instrumenting, and operating recursive LLM execution cycles that maintain state, enforce invariants, and converge on correct outcomes across unbounded interaction horizons — unlike prompt chains, which execute once and hope for the best.
Not a prompt chain that calls an LLM three times in sequence. Not a RAG pipeline with a reranker. A loop: an LLM that observes, decides, acts, observes the result, and decides again — with persistent context, explicit termination conditions, and observable intermediate states at every iteration.
The boundary is sharp. A prompt chain is a DAG. A loop is a state machine with an LLM as the transition function. The chain asks "what's the answer?" The loop asks "what's the next action given everything that happened so far?" That difference — statefulness with feedback — is why loops handle ambiguity, tool failure, and multi-step reasoning where chains collapse.
The flywheel: observe → decide → act → observe. Each turn tightens the context. The loop converges or it hits a guardrail. That convergence guarantee is the product.
Why the promtper-to-loop-designer shift matters
You're already writing prompts. Maybe you've built chains. The market doesn't need more prompt engineers. It needs engineers who can close the loop — who understand that reliability comes from structure, not clever wording.
| Dimension | Prompter | Loop Designer |
|---|---|---|
| Mental model | Single-turn optimization | Multi-turn convergence |
| Failure mode | "Fix the prompt" | "Fix the invariant" |
| State | Implicit, in context window | Explicit, persisted, versioned |
| Debugging | Re-run and pray | Replay with time-travel |
| Scaling | Longer prompts | More loops, tighter guards |
| Ownership | "It works on my machine" | "It converges in production" |
The shift isn't semantic. It's the difference between a demo that impresses investors and a system that survives Black Friday.
The 14-step roadmap
Phase 1: Foundations (Steps 1–4)
Step 1: Map the decision surface
Before writing a single loop, enumerate every decision point in your task. Not steps — decisions. A decision has alternatives, consequences, and a correctness criterion.
For a code-review loop:
- Decision: Is this PR ready to merge? (Alternatives: approve, request changes, block)
- Decision: Which file should I examine next? (Alternatives: 47 changed files)
- Decision: Is this pattern a bug or a style nit? (Alternatives: block, warn, ignore)
Tool: Draw a decision tree. If it has more than 7 leaf nodes, you need a loop. If it has cycles, you definitely need a loop.
Step 2: Define the loop state schema
State is not "the conversation history." State is a typed, versioned data structure that survives context truncation and enables replay.
@dataclass(frozen=True)
class LoopState:
iteration: int
task_spec: TaskSpec # Immutable goal
working_memory: WorkingMemory # Scratchpad, pruned each turn
long_term_memory: LongTermMemory # Facts, decisions, user prefs
tool_results: List[ToolResult] # Append-only log
invariants: InvariantSet # Must hold at every iteration
termination_signal: Optional[TerminationReason]
version: int # Schema version for migration
Rule: If you can't serialize it to JSON and replay the loop from step 3, it's not state — it's hope.
Step 3: Write the convergence contract
Every loop needs a termination predicate that is provably reachable. Not "when the model says done." A predicate over state:
def should_terminate(state: LoopState) -> TerminationReason | None:
if state.iteration >= MAX_ITERATIONS:
return TerminationReason.MAX_ITERATIONS_EXCEEDED
if state.invariants.all_satisfied() and state.working_memory.is_complete():
return TerminationReason.SUCCESS
if state.tool_results.last().is_fatal_error():
return TerminationReason.FATAL_ERROR
if not state.invariants.progress_made_since(state.iteration - 3):
return TerminationReason.STALLED
return None
No predicate → infinite loop → production incident. I've seen both.
Step 4: Choose your loop topology
Three patterns cover 90% of cases. Pick one. Don't hybridize until you have a reason.
| Topology | Structure | Use when |
|---|---|---|
| Single-agent reflexive | One LLM, internal monologue, tool calls | Deterministic tasks, clear specs, low ambiguity |
| Multi-agent debate | Proposer + Critic + Judge (or n-persona ensemble) | Subjective quality, adversarial verification needed |
| Hierarchical decomposition | Planner → Executor(s) → Aggregator | Long horizons, parallelizable subtasks, distinct skills |
My default: Single-agent reflexive with a critic pass at the end. Simpler to debug. Harder to fool yourself.
Phase 2: The loop body (Steps 5–9)
Step 5: Design the observation function
The observation function converts raw tool outputs, environment events, and prior decisions into the working memory for the next turn. It's the loop's eyes.
def observe(state: LoopState, tool_result: ToolResult) -> WorkingMemory:
# 1. Extract structured facts (not summaries)
facts = extract_facts(tool_result)
# 2. Score relevance to current subgoal
scored = score_relevance(facts, state.working_memory.current_subgoal)
# 3. Prune aggressively — context window is budget
pruned = prune_to_budget(scored, budget=TOKEN_BUDGET * 0.3)
# 4. Flag contradictions with long-term memory
contradictions = detect_contradictions(pruned, state.long_term_memory)
return WorkingMemory(
facts=pruned,
contradictions=contradictions,
current_subgoal=state.working_memory.current_subgoal,
token_count=count_tokens(pruned)
)
Lesson learned: Summarization loses signal. Extraction preserves it. Store facts, not narratives.
Step 6: Engineer the decision prompt
The decision prompt is the loop's brain. It receives: task spec + working memory + long-term memory + invariants. It outputs: next action + updated subgoal + confidence.
Template structure that works:
<task_spec>{{task_spec}}</task_spec>
<invariants>{{invariants.render()}}</invariants>
<working_memory>{{working_memory.render()}}</working_memory>
<long_term_memory>{{long_term_memory.render(relevance_threshold=0.7)}}</long_term_memory>
<instruction>
You are an iteration in a convergent loop. Your job: propose the single highest-leverage next action.
Output JSON only:
{
"action": {"tool": "...", "args": {...}},
"subgoal": "specific, measurable, time-bounded",
"confidence": 0.0-1.0,
"reasoning": "one paragraph, cites specific facts from memory"
}
</instruction>
Critical: The prompt must reference invariants by name. The model needs to know what it's not allowed to violate.
Step 7: Build the action executor with idempotency
Every tool call in a loop must be idempotent or compensatable. No exceptions.
class IdempotentToolExecutor:
def execute(self, action: Action, state: LoopState) -> ToolResult:
# Generate deterministic key from action + state snapshot
idempotency_key = hash((action.tool, action.args, state.iteration))
# Check if we've already executed this exact action
if cached := self.cache.get(idempotency_key):
return ToolResult.from_cache(cached)
# Execute with timeout and retry policy
result = self._execute_with_policy(action)
# Store for replay
self.cache.set(idempotency_key, result)
return result
War story: A loop retried a non-idempotent "send email" tool 47 times during a model hallucination spiral. Customer got 47 emails. We now wrap every external call.
Step 8: Implement invariant enforcement
Invariants are the loop's guardrails. They run after observation, before the next decision. Hard stops.
class InvariantSet:
def __init__(self):
self.invariants = [
Invariant(
name="budget_not_exceeded",
check=lambda s: s.working_memory.token_count < TOKEN_BUDGET,
remediation="summarize_and_prune"
),
Invariant(
name="no_contradictory_facts",
check=lambda s: not s.working_memory.contradictions,
remediation="escalate_to_human"
),
Invariant(
name="progress_per_three_iterations",
check=lambda s: s.invariants.progress_made_since(s.iteration - 3),
remediation="change_strategy"
),
]
def enforce(self, state: LoopState) -> LoopState:
for inv in self.invariants:
if not inv.check(state):
state = inv.remediation(state)
log.warning(f"Invariant violated: {inv.name}, applied {inv.remediation}")
return state
Design principle: Every invariant has a remediation, not just a failure. The loop heals itself.
Step 9: Close the long-term memory loop
Working memory is per-iteration. Long-term memory accumulates across loops — and across sessions.
class LongTermMemory:
def __init__(self, vector_store: VectorStore, kv_store: KVStore):
self.vector = vector_store # Semantic facts, embeddings
self.kv = kv_store # Structured: user prefs, decisions, schemas
def consolidate(self, state: LoopState) -> None:
# Extract durable facts from this loop's trajectory
facts = extract_durable_facts(state.tool_results)
for fact in facts:
self.vector.upsert(embedding=fact.embedding, metadata=fact.meta)
# Record decisions for future loops
for decision in state.decisions:
self.kv.set(f"decision:{decision.id}", decision.json())
def retrieve(self, query: str, k: int = 5) -> List[Fact]:
return self.vector.search(query, k=k)
Why it matters: A loop that learns from its own history converges faster. A loop that forgets repeats mistakes.
Phase 3: Production hardening (Steps 10–14)
Step 10: Instrument for time-travel debugging
You will need to answer "why did the loop do that at iteration 7?" — three months from now.
class LoopTracer:
def __init__(self, event_store: EventStore):
self.store = event_store
def trace_iteration(self, state: LoopState, decision: Decision, result: ToolResult):
self.store.append(LoopEvent(
loop_id=state.loop_id,
iteration=state.iteration,
timestamp=utc_now(),
state_snapshot=state.json(), # Full state
decision=decision.json(), # What the model chose
prompt=decision.prompt_sent, # Exact prompt
response=decision.raw_response, # Exact response
tool_result=result.json(), # What happened
invariants=state.invariants.status(), # Pass/fail per invariant
))
Query pattern: SELECT * FROM events WHERE loop_id = ? ORDER BY iteration. Replay any iteration in a notebook. Compare branches. This is how you actually improve loops.
Step 11: Build the evaluation harness
Unit tests for loops don't test prompts. They test convergence properties.
class LoopEvaluator:
def evaluate_convergence(self, loop: Loop, test_cases: List[TestCase]) -> EvaluationReport:
results = []
for tc in test_cases:
trace = loop.run(tc.input, max_iterations=20)
results.append(ConvergenceResult(
test_case=tc.id,
terminated=trace.terminated,
termination_reason=trace.termination_reason,
iterations=trace.iterations,
final_state_valid=tc.validator(trace.final_state),
invariant_violations=trace.invariant_violations,
token_cost=trace.total_tokens,
latency_p95=trace.latency_p95,
))
return EvaluationReport(results)
Metrics that matter: Convergence rate (not accuracy — convergence), iterations to convergence, invariant violation rate, cost per convergence. Accuracy is a consequence of convergence.
Step 12: Design the human-in-the-loop escalation
Loops stall. Invariants conflict. Models hallucinate. You need a clean handoff protocol, not a Slack alert.
@dataclass
class EscalationPacket:
loop_id: str
stuck_at_iteration: int
state_snapshot: LoopState
failing_invariants: List[InvariantViolation]
model_confidence_trajectory: List[float]
suggested_interventions: List[Intervention] # Pre-computed by a "meta-loop"
context_summary: str # Human-readable, < 500 tokens
UI requirement: The human sees the loop's reasoning trail, not just the current state. They can patch state (add a fact, override a subgoal, relax an invariant) and resume. The loop continues from the patched state.
Step 13: Implement cost and latency budgets
Loops are expensive. A 15-iteration loop with 8k context each turn = ~$2-5 per run at 2024 prices. You need budget-aware scheduling.
class BudgetScheduler:
def __init__(self, daily_budget_usd: float, max_latency_s: float):
self.daily_budget = daily_budget_usd
self.max_latency = max_latency_s
self.spent_today = 0.0
def can_launch(self, estimated_cost: float, estimated_latency: float) -> bool:
return (self.spent_today + estimated_cost <= self.daily_budget and
estimated_latency <= self.max_latency)
def record_run(self, actual_cost: float):
self.spent_today += actual_cost
Operational rule: If a loop type exceeds budget 3 days in a row, you don't get more budget — you get a redesign ticket.
Step 14: Ship the loop lifecycle
Loop engineering doesn't end at deploy. You need:
| Phase | Artifact | Cadence |
|---|---|---|
| Development | Local replay harness, synthetic test cases | Per PR |
| Staging | Shadow mode (loop runs, human decides, compare) | 2 weeks |
| Canary | 1% traffic, full tracing, auto-rollback on invariant spike | 1 week |
| Production | Daily convergence report, weekly invariant audit, monthly schema migration | Ongoing |
| Retirement | Migration plan for dependent loops, data export | When replaced |
The meta-loop: Treat your loop fleet as a system. A "loop registry" tracks versions, dependencies, convergence SLAs, and ownership. Loops call loops. The registry prevents circular dependencies and cascade failures.
Practical example: Automated code review loop
Let's walk a real loop through the 14 steps.
The task
Review PRs for: security issues, performance regressions, API compatibility breaks, and test coverage gaps. Post inline comments. Approve or request changes.
Loop state (Step 2)
@dataclass
class CodeReviewState(LoopState):
pr_metadata: PRMetadata
file_analyses: Dict[str, FileAnalysis] # Per-file results
cross_file_findings: List[CrossFileFinding] # Patterns spanning files
review_comments: List[ReviewComment] # Accumulated output
files_remaining: List[str] # Work queue
Invariants (Step 3)
-
files_remainingstrictly decreases each iteration - No file analyzed twice (idempotency)
- Total comments < 50 (prevent spam)
- Critical findings block approval
Topology (Step 4)
Hierarchical: Planner (prioritizes files) → Analyzers (parallel, per-file) → Aggregator (synthesizes, decides approval).
Observation function (Step 5)
Extracts: AST patterns, dependency changes, test diffs, prior review comments on same files. Prunes to 3k tokens per analyzer.
Decision prompt (Step 6)
Analyzer prompt receives: file diff + relevant context + security rules + performance patterns. Outputs: {"findings": [...], "confidence": 0.87, "needs_human": false}.
Action executor (Step 7)
Tools: get_file_diff, search_codebase, post_review_comment. All idempotent via (pr_number, file_path, comment_line) keys.
Invariant enforcement (Step 8)
After each analyzer: check comment budget, check duplicate findings, verify confidence > 0.6 for auto-posted comments.
Long-term memory (Step 9)
Stores: repo-specific patterns (e.g., "this team uses custom auth middleware"), false positive history, style guide embeddings.
Results after 3 months in production
| Metric | Before (prompt chain) | After (loop) |
|---|---|---|
| False positive rate | 34% | 8% |
| Missed critical issues | 12% | 2% |
| Avg iterations | N/A (single pass) | 4.2 |
| Human escalation rate | 28% | 6% |
| Cost per PR | $0.12 | $0.31 |
The cost increased. The quality converged. That's the trade. Loops aren't cheaper — they're reliable.
Use cases where loops win
| Domain | Why loops > chains |
|---|---|
| Incident response | Unknown steps, tool failures, need to pivot mid-stream |
| Complex code generation | Compile → test → fix cycles, cross-file consistency |
| Research synthesis | Iterative deepening, contradiction resolution, citation verification |
| Customer support resolution | Multi-turn dialogue, policy lookup, action execution, confirmation |
| Data pipeline debugging | Hypothesis → query → observe → refine, schema drift handling |
| Compliance auditing | Evidence gathering, rule interpretation, gap tracking, remediation |
Common thread: The path to the answer isn't known upfront. The system must discover it through interaction.
Pros and cons
Pros
- Convergence guarantees — Invariants + termination predicates = bounded behavior
- Observability — Every iteration is a checkpoint. Time-travel debugging is real.
- Self-healing — Remediation actions let loops recover from common failures
- Learning — Long-term memory compounds value across runs
- Human-AI collaboration — Clean escalation with state patching, not context dumping
Cons
- Latency — 5-20 iterations × 2-8s each = 10-160s. Not for user-facing sync paths.
- Cost — 3-10x prompt chains. Budget governance is mandatory.
- Complexity — State schema, invariants, tracing, evaluation harness. Real engineering investment.
- Debugging difficulty — Emergent behavior in multi-agent loops can surprise even the designer.
- Model dependency — Convergence quality tracks model reasoning capability. GPT-4o loops converge where 3.5-turbo loops stall.
The mindset shift
Prompt engineering asks: "How do I get the model to give me the right answer?"
Loop engineering asks: "How do I build a system that converges on the right answer even when the model gives me the wrong one?"
The second question changes everything. You stop optimizing prompts. You start designing state machines with LLMs as transition functions. You instrument. You invariant. You evaluate convergence, not accuracy.
That night at 2:47 AM, the refund loop had no invariants. No termination predicate. No long-term memory of the actual return policy. It was a prompt chain in a trench coat.
The replacement loop? 14 steps. 4 invariants. A termination predicate that catches "model says done but policy not checked." A long-term memory that stores the actual policy document. An escalation packet that shows the human exactly where the model drifted.
It hasn't hallucinated a return window in 14 months.
FAQ
What's the minimum viable loop?
State schema + observation function + decision prompt + action executor + termination predicate + one invariant. Skip any component and you have a chain, not a loop.
Can I use loops with smaller/cheaper models?
Yes, but convergence iterations increase. 3.5-turbo typically needs 2-3x iterations vs 4o for same task. Cost often equals out. Latency compounds.
How do I prevent infinite loops?
Hard iteration cap (invariant) + progress invariant (must advance every 3 iterations) + stall detection (same subgoal 3x) + budget invariant. Four layers.
What's the difference between a loop and an agent?
Marketing. "Agent" implies autonomy. "Loop" implies engineering discipline. I use "loop" because it forces you to think in control theory terms.
Do I need a vector database for long-term memory?
For semantic retrieval: yes. For structured facts (decisions, prefs, schemas): a KV store is faster and cheaper. Use both.
How do I test loops locally?
Replay harness: feed recorded tool results into the loop, verify it makes same decisions. Synthetic test cases: generate 100 variations of your task, measure convergence rate. CI gate: convergence rate > 95% on test set.
What about multi-agent loops — are they worth it?
For subjective quality (writing, design review, architecture critique): yes. Proposer-critic-judge reduces hallucination 40-60% in our benchmarks. For deterministic tasks: single-agent reflexive is simpler and faster.
How do I handle schema migrations for loop state?
Version every state schema. Migration functions are pure: v1_state → v2_state. Run migrations at loop start. Keep last 3 versions runnable. Test migrations in CI.
Can loops call other loops?
Yes. Hierarchical loops. But register them in a loop registry with explicit contracts. Prevent circular calls. Monitor cascade latency.
What's the biggest mistake teams make?
Treating invariants as optional. "We'll add them later." Later never comes. The first production incident adds them for you — at 3 AM.
Related Reading
- Building Reliable LLM Systems: The Control Theory Approach
- Invariant-Driven Development for AI Systems
- Time-Travel Debugging for Agentic Workflows
- The Loop Registry: Managing Loop Fleets at Scale
Top comments (0)