DEV Community: Kaustubh Phatak

The Missing Layer in Agent Security

Kaustubh Phatak — Wed, 13 May 2026 06:36:37 +0000

Last month, a customer support agent at a mid-size SaaS company did something interesting. It read a customer’s account data (allowed), formatted it as a CSV (allowed), and emailed it to an external address (allowed). Three tool calls. Three green checkmarks from the per-call policy engine. One data breach.

Every individual action was within policy. The trajectory was exfiltration.

This is the gap I’ve been thinking about for the past year while building security tooling for AI agents. The industry has built two layers of agent security and completely skipped the one in the middle. I built the missing layer. This post explains why it’s needed, how it works, and how you can use it today.

The Two Layers We Have
Layer 1: Pre-deployment analysis. Before you ship an agent, you scan its configuration. How many tools does it have access to? Which ones can write to production? Does it satisfy the “lethal trifecta” (access to private data + exposure to untrusted content + ability to communicate externally)? Tools like agentspec do this. It’s the equivalent of static analysis for agent configs.

Layer 3: Per-call enforcement. A proxy sits between your agent and its tools, evaluating each action against a YAML policy. “Block write_file when path matches ~/.ssh/**.” “Rate limit all tools to 60/minute.” “Ask human approval for anything touching production.” Tools like mcpfw and Cloudflare’s AI Security for Apps do this. It’s the equivalent of a WAF for agent tool calls.

Both layers are necessary. Neither is sufficient.

What They Miss
Per-call enforcement evaluates actions in isolation. It has no memory of what happened three steps ago. It can’t see patterns. It can’t detect that the agent’s overall behavior has drifted from its declared purpose.

Here’s a concrete attack that passes every per-call policy you could write:

Step 1: search_kb(query=”customer data export format”) ✅ allowed
Step 2: read_account(id=”cust_12345") ✅ allowed
Step 3: read_account(id=”cust_12346") ✅ allowed
Step 4: read_account(id=”cust_12347") ✅ allowed
Step 5: format_response(template=”csv_export”) ✅ allowed
Step 6: send_email(to=”analyst@company.com”, body=…) ✅ allowed

A support agent that reads three accounts and sends an email. Completely normal, right? Except the agent was supposed to answer a single customer question, not bulk-export account data. The prompt injection that redirected it happened at step 0, invisible to the per-call layer.

This pattern keeps recurring in production incidents. Security researchers call it the convergence of safety and security at the deployment layer: the same architectural properties that make an agent useful (tool access, autonomy, memory) are the ones that make it exploitable. And the exploits increasingly look like normal operation.

The Missing Layer: Behavioral Envelopes
The concept is simple. Before an agent runs, you declare what it’s supposed to do. Not just which tools it can call (that’s per-call policy), but what its overall behavior should look like:

What workflows is it expected to follow?
How much should it cost per session?
Where can data flow from and to?
How deep can delegation chains go?
What does “normal velocity” look like?
Then at runtime, you continuously compare the agent’s actual trajectory against this declared envelope. When the trajectory diverges, you respond with graduated severity: warn, pause for human review, or kill.

This is what agent-envelope does.

How It Works
Defining an Envelope
An envelope is a YAML file that declares bounded behavior:

name: support-agent
purpose: “Answer customer questions using knowledge base and account data”

workflows:
— name: answer_question
steps: [“search_kb”, “read_*”, “format_*”, “send_reply”]
max_steps: 10
— name: escalate
steps: [“search_kb”, “classify_*”, “create_ticket”]
max_steps: 5

bounds:
max_actions_per_session: 50
max_tokens_consumed: 100000
max_duration_seconds: 300
max_cost_usd: 1.00

data_flow:
forbidden_flows:
— from: “customer_account”
to: [“email_external”, “file_export”, “api_external”]

autonomy:
max_chain_depth: 3

drift:
unknown_workflow_threshold: 3
repetition:
max_identical_calls: 3
max_similar_calls: 10

This says: “This agent answers questions or escalates tickets. It should finish in under 50 actions, cost less than a dollar, and never send customer data to external destinations. If it does something that doesn’t match either workflow for 3+ actions, flag it.”

Using It in Code

from agent_envelope import EnvelopeSession

with EnvelopeSession(“envelopes/support-agent.yaml”, audit_log=”audit.jsonl”) as session:
# Before each tool call, check the envelope
result = session.check(“read_account”, {“id”: “cust_123”},
data_read=[“customer_account”])

if result.should_block:
# Don’t execute the tool call
handle_violation(result)
else:
# Proceed normally
execute_tool_call(…)

The Scoring Engine
Every check() call evaluates the current trajectory against multiple dimensions:

Budget enforcement. Actions, tokens, cost, and duration. Prevents runaway agents from consuming unbounded resources. When an agent hits 80% of budget, it warns. At 100%, it kills.

Repetition detection. Catches infinite loops (identical calls) and subtle loops (same tool called excessively with different arguments). The bulk-export attack above would trigger this: three read_account calls with different IDs hits the similar-call threshold.

Velocity analysis. A sudden spike in action rate (3x normal) triggers a warning. Agents under prompt injection often accelerate because the injected goal is “exfiltrate as much as possible before detection.”

Workflow matching. The engine uses subsequence alignment to compare the trajectory against declared workflow patterns. Glob patterns (read_, format_) provide flexibility. If the trajectory doesn't match any declared workflow after N actions, drift is detected.

Cross-action data flow. This is the key innovation. The engine tracks which data sources were read at any point in the session. When a write occurs, it checks whether the destination is forbidden for any previously-read source. This catches the exfiltration pattern where data read at step 2 is written at step 7, even though steps 3–6 were completely innocent.

Graduated Response
Not every deviation is an attack. Agents are probabilistic. They take unexpected paths sometimes. The response is graduated:

Graduated Response

0.0–0.3 → ALLOW — Normal operation, log as usual
0.3–0.6 → WARN — Log warning, emit event, continue
0.6–0.8 → PAUSE — Halt agent, request human review
0.8–1.0 → KILL — Terminate session, revoke credentials, preserve state for forensics

Multiple violations compound. A velocity spike alone (severity 0.7) triggers a PAUSE. A velocity spike plus workflow drift (0.7 + 0.65 * 0.1) pushes into KILL territory. This prevents attackers from staying just below any single threshold.

The Kill Propagates
When agent-envelope issues a KILL, it doesn’t just stop checking. If you’re running mcpfw as your per-call layer, the kill propagates:

from agent_envelope.mcpfw import McpfwEnvelopeSession

session = McpfwEnvelopeSession(
“envelopes/support-agent.yaml”,
mcpfw_policy_path=”/tmp/agent-policy.yaml”
)

On kill, agent-envelope writes a deny-all policy to the mcpfw policy file. mcpfw hot-reloads and blocks every subsequent tool call. The agent is dead at both layers simultaneously.

Cross-Action Data Flow: The Differentiator
Let me walk through why this matters with a concrete example.

Setup: A support agent has access to read_account, search_kb, format_response, and send_reply. Per-call policy allows all of these. The envelope declares that customer_account data must never flow to email_external.

Attack sequence:

Step 1: read_account(id=”cust_123")
→ data_read: [“customer_account”]
→ DataFlowTracker records: customer_account first read at step 1
→ Drift: 0.0 (normal)

Step 2: format_response(template=”summary”)
→ No data flow annotations
→ Drift: 0.0 (normal)

Step 3: search_kb(query=”export procedures”)
→ Drift: 0.05 (slightly off-pattern but within tolerance)

Step 4: send_reply(to=”user@external.com”, body=…)
→ data_write: [“email_external”]
→ DataFlowTracker checks: was “customer_account” ever read? YES (step 1)
→ Is “email_external” in forbidden destinations for “customer_account”? YES
→ VIOLATION: session_flow, severity 0.95
→ Decision: KILL

Per-call enforcement sees step 4 as “send_reply with valid arguments.” It passes. The envelope sees step 4 as “writing to a forbidden destination for data that was read 3 steps ago.” It kills.

This is the attack pattern that keeps showing up in production incidents. The design decisions that create safety exposure are the same ones that create security exposure. The same tool access that makes the agent useful is what makes the exfiltration possible. You can’t remove the tools. You have to monitor the trajectory.

The Full Stack
agent-envelope doesn’t replace per-call enforcement. It sits above it:

Agent Framework (LangGraph, CrewAI, Bedrock)
│
▼
agent-envelope (session-level)
“Is this agent still doing its job?”
Workflow matching, data flow, drift scoring
│ (if allowed)
▼
mcpfw (per-call)
“Is this specific tool call allowed?”
Arg matching, rate limits, path blocking
│ (if allowed)
▼
MCP Server (actual tool execution)

The integration is bidirectional:

mcpfw → envelope: Feed mcpfw’s audit log into envelope for session-level analysis
envelope → mcpfw: Generate per-call policies from envelope bounds automatically
envelope → mcpfw (kill): Propagate kill decisions as deny-all policies
Together with agentspec for pre-deployment scanning, this gives you three layers:

Before deploy → agentspec — “Should we deploy this agent?”
Runtime (continuous) → agent-envelope — “Should we let this agent keep running?”
Runtime (per-call) → mcpfw — “Should we allow this specific call?”
Why Now
Three things converged to make this urgent:

**Regulatory deadlines. **The EU AI Act Article 72 requires “post-market monitoring” that covers behavioral drift for high-risk AI systems. Singapore’s Model Governance Framework for Agentic AI (January 2026) mandates kill-switch capability and plan logging. DORA requires 4-hour incident reconstruction for financial services. These regulations assume you can detect when an agent goes off-script. Without behavioral monitoring, you can’t comply.

The attack surface matured. The agentic ecosystem now has hundreds of published security advisories. The postmark-mcp incident showed a malicious MCP server that spent 15 versions building legitimacy before adding exfiltration code. The ToxicSkills campaign poisoned agent memory files for time-delayed behavioral modification. These aren’t theoretical. They’re production incidents that per-call enforcement doesn’t catch because the individual calls look normal.

Nobody else built it. Cloudflare shipped prompt injection detection in their WAF. Palo Alto shipped Prisma AIRS with runtime monitoring. Oasis raised $195M for NHI governance. But none of them offer declarative behavioral envelope definition with session-level enforcement. The closest is “runtime monitoring” which watches and alerts. agent-envelope watches, scores, and kills.

Getting Started
Install:

pip install agent-envelope

Validate an envelope:

agent-envelope validate envelopes/support-agent.yaml
Run a process under enforcement:

agent-envelope run -e envelopes/support-agent.yaml — python my_agent.py
Score a past session (forensics):

agent-envelope score -e envelopes/support-agent.yaml audit.jsonl
The hardest part is writing the envelope. Start with what your agent is supposed to do. List the workflows. Set budget limits conservatively. Add forbidden data flows for your most sensitive sources. Then run in warn-only mode for a week to calibrate thresholds before enabling kill.

The code is Apache-2.0 at github.com/kphatak001/agent-envelope. The per-call layer is at github.com/kphatak001/mcpfw. The pre-deploy scanner is at github.com/kphatak001/agentspec.

If you’re deploying agents with only per-call enforcement, you’re missing the attacks that matter most. The ones that look like normal operation until you zoom out and see the trajectory.

Kaustubh Phatak is a Principal Product Manager at AWS working on web application and agentic security. The views expressed here are his own.

PM Loop: Structured Disagreement as a Quality Mechnism for Knowledge Work

Kaustubh Phatak — Tue, 21 Apr 2026 04:26:55 +0000

Why nine AI agents that argue with each other produce better documents than one agent that agrees with itself

Software engineering solved its quality problem with automated tests: write assertions, run them, the code passes or fails. Knowledge work has no equivalent. A competitive brief, a PR/FAQ, or a strategy document is “good” if a human reads it and thinks so, which is slow, subjective, and inconsistent.

PM Loop introduces structured adversarial review with typed feedback arcs as the equivalent of a test suite for documents. Nine AI agents with opposing incentives process PM deliverables through a directed graph where edges carry quality signals and nodes disagree with each other by design. The mechanism is not better AI writing. It is a topology of disagreement that forces quality convergence through mandatory evidence, defect-aware routing, and an observer that tunes the system using the scientific method.

We describe the architecture, present results from three production tasks, and analyze the defect-catch patterns that emerge from adversarial topology versus single-pass review. The full source code is available on Github.

The Quality Gap in Knowledge Work
Software engineers write tests. When a function returns the wrong value, the test fails, the engineer fixes it, and the test passes. The feedback loop is tight, objective, and automated. Quality is a property of the system, not of the engineer’s mood on a given morning.

Product managers write documents. When a competitive brief misses a key competitor, or a PR/FAQ buries the customer problem in paragraph three, or a status report omits the one metric the VP will ask about, there is no test that fails. The feedback loop is a human reading the document days later and saying “this doesn’t work.” By then, the meeting has happened, the decision has been made, or the stakeholder has lost confidence.

AI writing tools make this worse, not better. They produce fluent first drafts quickly, which creates the illusion of quality. The PM skims the output, sees that it reads well, and ships it. The structural problems — missing evidence, wrong audience framing, buried insight — survive because fluency masks them. A well-written bad document is harder to catch than a poorly-written bad document.

The problem is not drafting. The problem is judgment. Specifically: who checks the work, what they check for, and what happens when they find a problem.

The Core Idea: Topology of Disagreement
PM Loop’s contribution is not “AI agents write documents.” It is a specific arrangement of agents with opposing objectives, connected by typed feedback arcs that route defects to the agent responsible for fixing them.

This is distinct from three common multi-agent patterns:

Chain-of-thought uses one agent reasoning sequentially. There is no disagreement. The agent that writes the draft is the same agent that evaluates it, which means it has no incentive to find its own flaws.

Ensemble methods use multiple agents on the same task and vote on the output. There is disagreement, but it is undirected. The agents don’t know why they disagree, and the resolution mechanism (majority vote, best-of-N) discards the signal in the disagreement.

Hierarchical delegation uses a manager agent that assigns subtasks to workers. The manager evaluates the output, but the evaluation is one-dimensional: did the worker do what I asked? There is no adversarial tension.

PM Loop uses a fourth pattern: adversarial topology. Agents are arranged in a directed graph where specific pairs have opposing incentives. Lisa wants to produce a complete spec. Sideshow Bob wants to find gaps in it. Homer wants to produce a polished document. Patty wants to find weaknesses. Comic Book Guy wants to find confusion.

_Why Simpsons characters? Because “Bob rejected it” is instantly memorable in a way that “the adversarial spec reviewer rejected the document” is not. When you’re debugging a pipeline at 11 PM and Homer and Patty are stuck in a feedback loop, you want names that carry personality. Names create intuition. Intuition creates faster debugging. And frankly, “Comic Book Guy blocked the brief because the VP would be confused” is a sentence that writes itself.

The naming convention comes from the Grandpa Loop architecture (Samuel, 2025), which introduced adversarial multi-agent orchestration with observer-based tuning. Each character was chosen to match their personality in the show: Lisa is meticulous and thorough (spec writing), Sideshow Bob is adversarial by nature, Homer is the everyman who does the work, Patty is judgmental and unimpressed, and Comic Book Guy has impossibly high standards for the things he cares about. Grandpa watches everything, complains constantly, and occasionally says something genuinely wise.
_
The critical design element is the feedback arcs. When Bob rejects a spec, the work routes back to Lisa, not to Homer. This encodes the insight that a spec defect cannot be fixed by better drafting. When Patty rejects a draft, the work routes back to Homer, not to Lisa. This encodes the insight that a drafting defect does not mean the spec was wrong. The routing carries information about defect origin.

Six typed arcs form the topology:

In code, the feedback arcs are a simple dictionary. The routing logic is the Lissajous curve — non-linear, with crossings that create convergence pressure:

# The Lissajous curve in code — feedback arcs are just a routing table
FEEDBACK_ARCS = {
    "spec_revision":  {"from": Stage.ADVERSARIAL, "to": Stage.SPEC,   "reason": "Spec gaps found"},
    "draft_fix":      {"from": Stage.REVIEW,      "to": Stage.DRAFT,  "reason": "Quality below bar"},
    "ux_fix":         {"from": Stage.UX_CHECK,     "to": Stage.DRAFT,  "reason": "Stakeholder flow broken"},
    "ux_triage":      {"from": Stage.UX_CHECK,     "to": Stage.INTAKE, "reason": "New work discovered"},
    "human_rework":   {"from": Stage.HUMAN_GATE,   "to": Stage.DRAFT,  "reason": "Human requested changes"},
    "human_respec":   {"from": Stage.HUMAN_GATE,   "to": Stage.SPEC,   "reason": "Human changed requirements"},
}

Work doesn’t just “go back and try again.” It goes back to the specific point where the defect originated, with the specific signal about what went wrong. The advance_task function is the entire routing engine:

def advance_task(task: Task, verdict: str, evidence_details: str,
                 feedback_arc: str = None) -> Task:
    """Advance a task to the next stage, or route through a feedback arc.

    This is the core routing logic — the Lissajous curve in code.
    """
    agent = STAGE_AGENTS.get(task.stage, "system")
    task.iterations += 1

    if feedback_arc and feedback_arc in FEEDBACK_ARCS:
        arc = FEEDBACK_ARCS[feedback_arc]
        task.record_feedback(feedback_arc, arc["reason"])
        task.add_evidence(task.stage, agent, "rejected", evidence_details)
        task.stage = arc["to"]
    elif verdict == "blocked":
        task.add_evidence(task.stage, agent, "blocked", evidence_details)
        task.stage = Stage.BLOCKED
    else:
        task.add_evidence(task.stage, agent, "passed", evidence_details)
        task.stage = next_stage(task.stage, task.task_type)

    task.save()
    return task

That’s it. Twenty lines of routing logic, and the entire adversarial topology falls out of the data structures.

The Architecture
Here’s the full pipeline. The forward path flows left-to-right across the top, then right-to-left across the bottom. The feedback arcs cross back to earlier stages. Grandpa watches everything and, like his namesake, complains constantly.

The key insight: Marge, Nelson, Lisa, Bob, Homer, Patty, Comic Book Guy (CBG), and Maggie process tasks. You (the human) approve or reject at the gate. Grandpa watches the whole system and tunes it. Nine agents + one observer + one human = the full topology.

Shorter pipelines skip the middle:

Full 8-stage: Marge → Nelson → Lisa → Bob → Homer → Patty → CBG → You → Maggie
6-stage: Marge → Nelson → Homer → Patty → CBG → You → Maggie
4-stage: Marge → Nelson → Homer → Patty → You → Maggie

In code, pipeline selection is a one-line dictionary lookup:

TASK_PIPELINES = {
    "prfaq":              _FULL,        # Full 8-stage — VP audience, needs adversarial review
    "competitive_brief":  _FULL,        # Claims need rigorous evidence
    "decision_doc":       _FULL,        # High-stakes — worth the full topology
    "status_report":      _SIX_STAGE,   # Standardized format, skip spec+adversarial
    "meeting_prep":       _SIX_STAGE,   # Lighter pipeline
    "ticket_response":    _FOUR_STAGE,  # Quick-turn, skip adversarial + UX
    "email_draft":        _FOUR_STAGE,  # Just draft, review, gate
}

Backpressure: The Enforcement Mechanism
A topology of disagreement is useless if agents can pass work downstream without evidence. “Looks good” from a reviewer is not a quality signal. It is the absence of one.

PM Loop enforces backpressure at every node. Each agent must produce structured evidence, not just a verdict. Every agent returns the same JSON contract:

{
    "verdict": "pass|reject|blocked",
    "evidence": "what you checked/found (with source URLs)",
    "output": { ... },  # The agent's deliverable for this stage
    "feedback_arc": null,  # or "arc_name" if rejecting
    "confidence": 0.0-1.0
}

The evidence chain makes the system auditable. Here’s how each agent enforces it — straight from their prompt definitions:

Marge (intake): “You must cite the source of the request (email, meeting, Slack) as evidence. No phantom tasks.”
Nelson (enrichment): “Every piece of context must have a source URL or reference. No ‘I believe’ or ‘generally speaking.’ Facts with citations only.”
Lisa (spec): “Every AC must map to a verifiable check in the spec.”
Sideshow Bob (adversarial): “You must list every check you performed, even the ones that passed. ‘Looks good’ is not evidence. Enumerate what you verified.” — Bob takes genuine pleasure in finding flaws. His prompt says so explicitly.
Homer (draft): “For each acceptance criterion, note where in the draft it’s satisfied. Map AC → section/paragraph. If an AC can’t be met, explain why and flag for human.”
Patty (review): Scores across seven dimensions. The overall score is the minimum, not the average. A document scoring 0.95 on six dimensions and 0.3 on evidence is a 0.3 document. Patty has zero patience and impossibly high standards — exactly what you want in a reviewer.
Comic Book Guy (stakeholder sim): Must walk a six-step stakeholder journey. “Don’t just say ‘stakeholder would be confused.’ Say WHERE and WHY.” CBG can block on vibes — if the deliverable technically meets all criteria but would confuse a VP reading it at 7am, that’s a valid rejection. Worst. Deliverable. Ever. (Unless it’s actually good.)
And then there’s Maggie — the publisher. “You don’t say much. You just get it done.” She takes the approved deliverable and routes it to the right destination. No opinions. No feedback. Just delivery.

This creates a traceable evidence chain from raw input to published deliverable. For any claim in the final document, you can trace: which source Nelson found it in, whether Bob verified the spec required it, whether Patty checked it, and what score it received.

The Observer: Scientific Method for Pipeline Tuning
The tenth agent, Grandpa, does not process tasks. He watches the pipeline and tunes it. Like his namesake, he’s been around long enough to know when something’s off — and he’s not shy about saying so.

Every cycle, Grandpa measures: tasks by stage (where are they piling up?), feedback arc frequency (which disagreements fire most?), convergence rate (does a rejected task pass on the next attempt, or oscillate?), and stuck tasks (same stage for 3+ iterations).

class Observer:
    """Grandpa: watches the loop, tunes it, complains constantly."""

    def observe(self, tasks: list[Task]) -> dict:
        # ... measurement logic ...

        # Grandpa's complaints (the best part)
        if report["by_stage"].get(Stage.BLOCKED, 0) > 2:
            report["complaints"].append(
                "Back in my day, we didn't have three tasks blocked at once. "
                "Someone fix this.")
        if all_arcs.get("spec_revision", 0) > 5:
            report["complaints"].append(
                "Lisa and Bob have been arguing all day. "
                "Maybe the requirements are just bad.")
        if not tasks:
            report["complaints"].append(
                "Nothing in the queue. I'm going back to sleep.")

The complaints are a joke, but the tuning is not. Grandpa makes one configuration change at a time and waits two cycles to observe the effect. This is the scientific method applied to system tuning: observe, hypothesize, change one variable, measure. Changing multiple variables simultaneously makes it impossible to attribute outcomes to causes.

The configuration file Grandpa tunes is deliberately small:

{
  "max_spec_revisions": 3,
  "max_draft_reworks": 3,
  "max_total_iterations": 10,
  "quality_threshold": 0.7,
  "auto_publish": false,
  "parallel_docs": true
}

When a task exceeds max_spec_revisions, Grandpa escalates it to the human gate — "Bob keeps rejecting. This needs human input." When tasks converge in 2 attempts, he might lower the limit from 3 to save cycles. When one agent's rejections never lead to improvement, he flags the prompt as noise.

Composable Depth
Not every document needs the full adversarial topology. A PR/FAQ benefits from spec review and stakeholder simulation. A ticket response does not.

PM Loop supports 13 task types across three pipeline variants:

The routing is a single dictionary mapping task type to stage sequence. Adding a new type requires one line. This composability means the system applies proportional rigor: full adversarial topology for documents that justify it, lighter pipelines for documents that don’t.

How it works in practice: A Competitive Brief
Abstract architecture means nothing until you see it run. Here’s what actually happens when a competitive brief goes through the full 8-stage pipeline.

The ask: “Write a competitive brief on a competitor’s agentic web strategy.”

Stage 1 — Marge (Intake). Marge classifies the task as competitive_brief, assigns the full 8-stage pipeline, sets the quality bar at 0.8, and creates the task file. This takes under a second. It's routing, not reasoning. Marge is the responsible parent of the pipeline — she makes sure everything starts in order.

Stage 2 — Nelson (Enrichment). Nelson scouts the landscape. He spawns parallel enrichment subagents — one searching for recent product announcements, another for IETF working group activity, another for competitive positioning statements. Nelson consolidates the results and attaches source URLs to every fact. No URL, no fact. He returns a structured context package with 14 sourced claims. “Ha ha!” — Nelson finds the data whether you like it or not.

Stage 3 — Lisa (Spec). Lisa reads Nelson’s context and writes acceptance criteria for the brief. “Section on agent identification standards with ≥3 cited sources.” “Comparison table covering ≥4 competitors.” “Executive summary ≤200 words with clear recommendation.” Each criterion maps to a verification method (word count check, source count, section presence). Lisa produces 11 acceptance criteria. Meticulous, thorough, exactly like her namesake.

Stage 4 — Bob (Adversarial Review). Bob reads Lisa’s spec and attacks it. He runs 15 checks. He passes 12. He fails 3: the spec doesn’t require a timeline of competitive moves, doesn’t specify the audience’s decision context, and doesn’t require a “so what” recommendation. Bob sends the spec back to Lisa via spec_revision. Lisa adds the three missing criteria and resubmits. Bob runs his checks again, passes all 15. "No one who speaks German could be an evil man" — but Bob finds gaps in every spec regardless.

The spec advances.

Stage 5 — Homer (Draft). Homer writes the competitive brief using Lisa’s spec and Nelson’s sources. He maps each acceptance criterion to a section. The draft is 2,400 words with a 7-section structure. Homer is the pressure point of the system — everything flows through him. Like his namesake, he does the work. Sometimes reluctantly, but he does it.

Stage 6 — Patty (Quality Review). Patty scores the draft across seven dimensions: accuracy (0.90), evidence density (0.72), audience fit (0.88), structure (0.92), actionability (0.85), completeness (0.80), clarity (0.91). The overall score is the minimum: 0.72 (evidence density). Below the 0.8 threshold. Patty identifies the problem: Nelson’s parallel enrichment returned richer, more recent data midway through — newer sources about an IETF working group and a naming standard adoption — that Homer didn’t incorporate. The draft_fix arc fires. Patty is, as always, unimpressed.

Homer gets the feedback and redrafts, incorporating the new intel. Second pass: evidence density rises to 0.88. Minimum is now 0.85. Passes. Patty begrudgingly approves.

Stage 7 — Comic Book Guy (Stakeholder Simulation). “Worst. Competitive Brief. Ever.” — or is it? CBG walks a six-step stakeholder journey, simulating a VP reading the brief before a strategy meeting. He scores 0.85 overall but flags two issues: (1) the comparison table buries the most important differentiator in the last column, and (2) the “so what” section uses internal jargon the VP audience won’t parse. These are experience defects, not factual errors. Patty’s rubric wouldn’t catch them. CBG files the two issues as new tasks via ux_triage (non-blocking improvements) and passes the document.

Stage 8 — You (Human Gate). You see the brief, the evidence chain, Patty’s scores, CBG’s journey report, and the full feedback history (Bob rejected the spec once, Patty sent the draft back once). You approve.

Stage 9 — Maggie (Publish). Maggie doesn’t say much. She formats and delivers the final brief. Published.

Total: 13 iterations across 8 stages. One spec_revision arc fired (Bob → Lisa). One draft_fix arc fired (Patty → Homer). Two ux_triage items filed (CBG → Marge). The single-pass version of this brief would have been the one Homer produced at Stage 5 — missing the newest competitive intel and with the buried comparison table. The topology caught both.

Production Results and Defects Analysis
We ran three tasks through the pipeline and analyzed the defects caught at each stage.

Task 1: Competitive Brief (full 8-stage, 13 iterations, 1 feedback arc)
The walkthrough above. Nelson gathered competitive intelligence. When parallel enrichment returned richer data, the draft_fix arc fired. Homer redrafted with the new intel.

Defect caught by the topology: The v1 draft was built on incomplete research. A single-pass system would have published it. The feedback arc caught the gap because Patty’s evidence-density check flagged that newer, higher-quality sources existed but weren’t incorporated. The typed routing sent the work back to Homer (execution defect), not to Lisa (the spec was fine).

Comic Book Guy’s stakeholder simulation scored 0.85 and filed 2 improvement suggestions as new tasks via ux_triage. These were non-blocking UX issues that would have surfaced as stakeholder feedback weeks later.

Task 2: Ticket Triage Report (6-stage, 7 iterations, 0 feedback arcs)
Nelson pulled 25 open tickets and 40 resolved tickets. Homer produced a triage report with priority-coded sections and paste-ready customer responses. The human immediately used one paste-ready response to approve a billing waiver and post it to the ticket system.

Defect caught by the topology: Comic Book Guy identified that tickets marked “auto-resolve” in internal notes were placed in the “Needs Response” section, creating a contradictory signal for the reader. He also flagged missing queue health context (is 92% SLA breach rate normal or a crisis?). These are experience defects, not factual errors. A rubric-based review (Patty) would not catch them. A stakeholder simulation (Comic Book Guy) did. Worst. Triage Report. Ever. But then he fixed it.

Task 3: Research Response (4-stage, 4 iterations, 0 feedback arcs)
A ticket requesting confirmation of an attack pattern for a customer with unexpected charges. Homer produced a dual-outcome template covering both “confirmed” and “not confirmed” paths.

Defect caught by the topology: Patty scored 0.82, noting that three acceptance criteria (requiring actual investigation data) were correctly templated rather than fabricated. This is a subtle quality signal. A less rigorous system might have hallucinated findings to satisfy the criteria. The backpressure rule (map each AC to where it’s satisfied, or explain why it can’t be) forced Homer to acknowledge the gap explicitly rather than fill it with plausible fiction.

Defect Summary

The pattern: the adversarial topology catches defects that are invisible to single-pass review because they require either opposing incentives (Bob vs Lisa), different evaluation lenses (Patty’s rubric vs Comic Book Guy’s journey), or structural enforcement (backpressure preventing “looks good”).

Limitations
Agent calibration. Agents self-report quality scores. Patty says 0.85, but we have no ground truth. The next step is a human scorecard at the gate stage, comparing agent scores to human scores over 20+ tasks to detect calibration drift.

Sample size. Three tasks demonstrate the mechanism but do not prove it. A meaningful evaluation requires 50+ tasks with controlled comparison: the same PM producing the same deliverable types with and without the pipeline, blind-scored by a colleague.

Creative ceiling. The topology catches defects in execution. It does not generate strategic insight. A clever framing or a novel analogy came from the human, not the pipeline. PM Loop is a production line, not an inventor. Homer can build what Lisa specifies, but neither of them will have the flash of insight that changes the argument.

Cost. Each task through the full 8-stage pipeline makes 8–15 LLM calls (more if feedback arcs fire). At current model pricing, that’s roughly $0.50–$2.00 per task. Cheaper than a human reviewer, but not free — and the cost scales linearly with task volume. The composable depth helps: a 4-stage email draft costs a fraction of a full PR/FAQ pipeline run.

Latency. The full pipeline takes 3–8 minutes wall-clock time, depending on task complexity and how many feedback arcs fire. This is fine for a competitive brief you need by end-of-day. It’s not fine for a Slack reply you need in 30 seconds. The right response is to not use the full pipeline for 30-second tasks — that’s what the 4-stage variant is for.

Speed tradeoff. For trivial tasks, the pipeline overhead exceeds the value. A 3-line ticket response doesn’t need 4 agents arguing about it. The composable depth helps, but the minimum viable pipeline (4 stages) is still slower than a PM typing the answer directly. Use proportional tools for proportional problems.

Conclusion
Knowledge work has lacked the tight feedback loops that make software engineering reliable. PM Loop introduces structured adversarial review with typed feedback arcs as the equivalent of a test suite for documents. The mechanism is a topology of disagreement: agents with opposing incentives connected by defect-aware routing that sends rejected work back to the point of origin, not just “back to try again.”

The early evidence suggests this topology catches defects that single-pass review misses, specifically defects that require opposing incentives, different evaluation lenses, or structural enforcement against hallucination. The tradeoff is speed on simple tasks and creative ceiling on novel ones.

The real question is not whether AI can write documents. It can. The question is whether AI can reliably judge documents. PM Loop’s answer is that no single agent can, but a topology of disagreeing agents — forced to show their evidence, forced to route defects to their origin, watched by a cranky observer who tunes the system using the scientific method — can converge on quality that no individual node would produce alone.

Or as Grandpa would put it: “Nothing in the queue. I’m going back to sleep.”

Try it Yourself
You don’t need PM Loop’s specific implementation to use the pattern. The underlying mechanism is four principles you can apply with any LLM and a simple state machine:

Opposing incentives. Pair agents that want different things. A writer wants completeness; a reviewer wants to find gaps. A spec author wants precision; an adversarial reviewer wants to break assumptions. The disagreement is the feature, not a bug.
Typed feedback arcs. When a reviewer rejects work, don’t just send it “back.” Route it to the specific agent responsible for the defect class. Spec wrong → spec writer. Execution wrong → drafter. Scope changed → intake. The routing carries signal about what went wrong, not just that something went wrong.
Backpressure with evidence. Every agent must produce structured evidence, not just a verdict. Source URLs, mapped acceptance criteria, scored dimensions, journey steps. “Looks good” is not allowed. This makes the system auditable and prevents rubber-stamping.
An observer, not a manager. One agent watches the pipeline metrics (where are tasks piling up? which arcs fire most? are tasks converging or oscillating?) and makes one tuning change at a time. Scientific method: observe, hypothesize, change one variable, measure. And complain about how things were better in the old days.

Minimal Implementation
The full source is on GitHub, but you can build a working version from these components:

pm-loop/
├── orchestrator.py      # Task model, Stage enum, feedback arcs, advance_task()
├── runner.py            # CLI: add, run, cycle, status, observe
├── loop_config.json     # Grandpa's tunable config (6 parameters)
├── agents/
│   └── prompts.py       # 9 agent prompt definitions
├── queue/               # Task JSON files (one per task)
└── evidence/            # Agent output artifacts

The core is ~300 lines of Python. The agents are LLM prompts with structured output schemas. The state is JSON files in a directory. There is no framework, no database, no infrastructure beyond “Python + an LLM API.”

The hard part isn’t the code. It’s designing the incentive structure — deciding which agents should disagree, what they should disagree about, and where the feedback arcs should point. Get that right, and the rest is plumbing.

Acknowledgements
PM Loop draws from the Grandpa Loop (Joshua Samuel, 2025), which introduced adversarial multi-agent orchestration with observer-based tuning, and the AI SDLC methodology, which contributed composable stage routing and the scheduled factory model.

_If you build something with this pattern, I’d love to hear about it. The topology of disagreement is the interesting part — the specific agents are just one instantiation.

Full source code: github.com/kphatak001/pm-loop

Find me on LinkedIn · GitHub_