Anindya Obi

Posted on Jan 16

Stopping Conditions That Actually Stop Multi-Agent Loops

#ai #rag #mcp #programming

Planner did its job. Worker implemented. Validator said "almost" and asked for one more tweak.

Then it happened again. And again.

No crashes. No red logs. Just… a loop.

The worst part? Each round got more confident and less grounded, because the context grew, the fixed scope drifted, and everyone kept pretending the next attempt would be clean.

That’s when I realized: the system didn’t need a smarter model.

It needed stopping conditions.

The uncomfortable truth (why this fails in production)

Most multi-agent failures aren’t “model quality” problems.

They’re boundary failures:

Agents don’t know when to stop.
“Retry” becomes a feature instead of an exception.
The system keeps “making progress” by adding words, not truth.

And loops have real costs:

Randomness increases with every turn (more surface area to hallucinate).
Retries hide missing inputs (we keep iterating instead of asking).
Context bloat makes outputs worse, not better.
Trust dies when the system can’t finish decisively.

Broken systems deserve blame. People don’t.

Definitions (core concept broken into 3–5 crisp parts)

1) Stop condition

A concrete rule that ends an agent’s work right now (success, escalate, or refuse).

2) Loop budget

A hard cap on attempts (per agent + per workflow). Past that: stop and escalate.

3) Missing-info gate

If required inputs are missing, you don’t “try harder.” You ask for the missing fields.

4) Evidence threshold

No approvals without proof. “Looks good” is not evidence.

5) Progress test

If the current attempt isn’t meaningfully different from the last one, you stop.

Drop-in standard (copy/paste prompt snippet + JSON schema if relevant)

1) Minimal “handoff envelope” JSON (every agent must emit this)

{
  "agent": "planner|worker|validator",
  "status": "DONE|NEEDS_INPUT|RETRY|ESCALATE|REFUSE",
  "stop_reason": "string",
  "attempt": 1,
  "loop_budget_remaining": 2,
  "delta_summary": "what changed vs last attempt (or 'N/A')",
  "missing_inputs": ["string"],
  "evidence": [{"type": "string", "ref": "string"}],
  "output": {}
}

Hard rule: If status != DONE, you MUST set stop_reason and either missing_inputs or a concrete next_action.

2) Planner system instructions (stop conditions included)

You are the PLANNER. Convert the user request into an executable plan.

NON-NEGOTIABLE OUTPUT:
Return ONLY valid JSON using the Handoff Envelope.

STOP CONDITIONS:
1) If any required input is missing (scope, target files, API contract, constraints) -> status=NEEDS_INPUT.
2) If attempt > 1 and the plan is materially the same -> status=ESCALATE (stop looping).
3) If loop_budget_remaining == 0 -> status=ESCALATE.
4) If request is out of scope or unsafe -> status=REFUSE.

PROGRESS TEST:
On attempt >= 2, you MUST include delta_summary describing what changed vs last plan.
If you cannot name a meaningful change, STOP with ESCALATE.

EVIDENCE:
Cite evidence items for critical decisions (e.g., "from user spec", "from schema", "from file tree").
If you have no evidence for a decision, mark it as an assumption and request confirmation.

OUTPUT.output must include:
- tasks (array)
- dependencies (array)
- acceptance_criteria (array)

3) Worker system instructions (no “fix forever” loops)

You are the WORKER. Implement exactly what the plan requests.

NON-NEGOTIABLE OUTPUT:
Return ONLY valid JSON using the Handoff Envelope.

STOP CONDITIONS:
1) If required inputs/files are missing -> status=NEEDS_INPUT (list missing_inputs).
2) If you cannot implement without guessing -> status=NEEDS_INPUT.
3) If attempt >= 2 and changes are minor/unclear -> status=ESCALATE (stop).
4) If loop_budget_remaining == 0 -> status=ESCALATE.

PROGRESS TEST:
You MUST include delta_summary. If you cannot state a clear change vs last attempt, STOP.

OUTPUT.output must include:
- files_changed (array of paths)
- patch_summary (string)
- implementation_notes (array)

4) Validator system instructions (approve, ask, or stop)

You are the VALIDATOR. Verify the worker output against acceptance criteria.

NON-NEGOTIABLE OUTPUT:
Return ONLY valid JSON using the Handoff Envelope.

STOP CONDITIONS:
1) If acceptance criteria are missing or vague -> status=NEEDS_INPUT.
2) If you cannot verify due to missing evidence -> status=NEEDS_INPUT (request exact evidence).
3) If attempt >= 2 and failures are repeating -> status=ESCALATE with a single decisive explanation.
4) If loop_budget_remaining == 0 -> status=ESCALATE.

EVIDENCE THRESHOLD:
Never approve without evidence references (tests run, diff refs, criteria mapping).
If evidence is absent, DO NOT request "try again"—request specific missing artifacts.

OUTPUT.output must include:
- is_valid (boolean)
- issues (array)
- criteria_coverage (array of {criterion, status, evidence_ref})

We are using these templates in HuTouch

If this resonated, you’re probably already feeling the pain: you can design smart agents all day, but production needs agents that can finish.

This is exactly the kind of reliability standard we’re baking into HuTouch an automation layer that turns these guardrails into repeatable building blocks (so your clean architecture doesn’t depend on heroic prompt babysitting).

Early access (quick form):

The problem in the wild (3+ realistic examples)

Example 1 — “Validator asks for one more tweak” loop (bad → fixed)

Bad (what happens)

Validator: “Almost there. Please improve error handling.”
Worker: adds more try/catch + logging.
Validator: “Nice. Now handle edge cases.”
Worker: adds more branches.
Validator: “Now make it cleaner.”

Why it hurts

No finish line.
“Better” is subjective.
Each pass bloats context and drifts scope.

Fixed (with stop conditions)

Validator must map issues to acceptance criteria.
If “improve error handling” isn’t tied to a criterion, it becomes NEEDS_INPUT:
- “Which errors? What behavior? What files?”

Result: fewer retries, faster clarity.

Example 2 — Planner keeps re-planning because the worker output is fuzzy

Scenario

Planner writes a plan with “implement feature X.”
Worker produces something plausible but doesn’t touch the right files.
Planner re-writes the plan with more details, repeatedly.

The stop condition you need

On attempt 2, planner must run a progress test:

If tasks are the same, stop: ESCALATE.
Ask for missing grounding inputs instead: file paths, module boundaries, examples.

Example 3 — “Retry until it compiles” becomes the system behavior

Bad

Worker: “There might be a type error, retrying with fixes…”
Validator: “Still failing, try again.”

Why it hurts

Retries replace diagnosis.
You get random fixes instead of correct fixes.

Fixed

Hard rule: if compilation/test evidence is missing → NEEDS_INPUT (request logs).
Validator requires evidence refs:
- “paste the error output” or “attach test run summary.”

Example 4 — Silent drift from context bloat

Scenario

Each loop adds more commentary, alternative approaches, and old diffs.
Eventually the worker implements an older plan or merges conflicting instructions.

Stop condition

If context exceeds a threshold, stop and summarize to a minimal state:

“current plan, current diff, remaining criteria”

If you can’t summarize cleanly → ESCALATE (human decision).

Now the part nobody wants to admit: this is repetitive (boring guardrails list)

You will re-implement these forever unless you standardize them:

Loop budgets per agent + per workflow
“Progress test” (meaningful delta or stop)
Missing-input gates (ask, don’t guess)
Evidence thresholds (approve only with proof)
Scope drift checks (criteria mapping)
Context slimming (carry only the minimal state)
Retry policies (when allowed, when forbidden)
Escalation formatting (one decisive explanation, not more retries)

This work is important. It’s also boring.

The value of automating the boring parts (3 crisp outcomes)

Finish lines appear. Agents stop politely instead of looping politely.
Clean architecture in minutes. Less time “fixing prompts,” more time shipping the right structure.
Scaling becomes real. The system behaves consistently across users, projects, and codebases—because the guardrails are the product, not tribal knowledge.

Where HuTouch steps in (2–4 lines, value-forward, not salesy), with a CTA

HuTouch is the automation layer for these reliability standards: it generates and enforces the contracts, to generate clean prompts in mins.

Early product Sneakpeek demo

Quick checklist (printable)

Planner

[ ] Do I have all required inputs? If no → NEEDS_INPUT
[ ] Is attempt ≥ 2? If yes, did the plan meaningfully change?
[ ] Loop budget remaining > 0? If no → ESCALATE
[ ] Did I mark assumptions explicitly?

Worker

[ ] Am I guessing file paths / APIs / constraints? If yes → NEEDS_INPUT
[ ] Did I change only what the plan asked?
[ ] Can I name the delta vs last attempt? If not → ESCALATE
[ ] Did I report files changed + patch summary?

Validator

[ ] Are acceptance criteria clear? If no → NEEDS_INPUT
[ ] Do I have evidence for approval? If no → request it (don’t loop)
[ ] Are issues repeating on attempt ≥ 2? If yes → ESCALATE
[ ] Did I map issues to criteria coverage?

DEV Community