canonical_url:
I do a lot of deployment authorization docs. That's the PM version of what an SRE would call a launch checklist. It lists the agent...
For further actions, you may consider blocking this person and/or reporting abuse
honestly the cost-ceiling fix in Gap 1 assumes you can estimate token cost in advance, which breaks the moment your agent loops on a multi-turn tool call with variable input length. our wrapper double-counts on those paths and i don't have a clean answer yet - curious how teams handle the variable-length case.
Ran my own agent fleet for a week and the audit you describe - 45 minutes, 3 columns, brutal is the most useful thing I did the whole sprint. My version found 3 gaps, not 5, but two of them were 'kill switch exists but I've never tested it under load' which is functionally the same as no kill switch. The deployment doc updating against an external rubric isn't paperwork. It's the only way you find out which of your assumptions were vibes.
untested kill switch is a plan, not a control - had that exact gap. agent set up to interrupt on error rate spike, switch had never fired in prod. incident dragged 40 min past where it needed to. what was the third one in your version?
The third gap was rollback granularity. Kill switch fires, but rolling back to a known-good state assumed a checkpoint cadence that didn't exist — running my agent fleet for ~a week, only the orchestrator had checkpoints; individual workers were stateless replays. Recovery time wasn't "switch flipped", it was "rebuild context from scratch." Forty minutes was generous on your incident.
yeah the rollback granularity is the part nobody thinks about until the switch actually fires. I had the same assumption baked in - orchestrator has checkpoints, workers will just replay. until you realize the workers dont know where to replay from. have you figured out a way to checkpoint workers without making every write synchronous?
Yeah, two approaches that worked for me: (1) write-ahead log per worker with async flush - orchestrator reads the WAL position on replay, sync only on WAL ack not the actual write. (2) idempotency-keyed writes so replay re-issues with the same key and the data layer dedups. Option 1 is cheaper, option 2 is safer for non-idempotent ops. Both add 100-300ms per write. Still cheaper than a 40-min incident.
WAL approach is right. option 2 breaks when your data layer doesn't do idempotency natively - which most don't.
This hit close to home. The "second reader" insight is spot on most deployment docs pass internal review because everyone's grading against the same blind spots.
Gap 1 (per-action vs per-month cost ceiling) is something I've seen bite teams hard. A monthly budget feels safe until one runaway agent burns through it in an afternoon.
Gap 4 about retries being in the harness not the agent loop that's a lesson most teams learn the expensive way.
Really solid framework, might steal your 3-column grading exercise for our own docs.
👍 👍 👍
the monthly budget is a trap - feels like control until you hit your first runaway loop. per-action ceiling is the only check that actually fires before the damage is done.
The deployment doc framing is the right instinct and I think the networking layer is consistently the missing item. Most checklists cover secrets, kill switches, and cost ceilings but skip how agents actually reach each other in production across cloud, home networks, and edge. NAT traversal, peer authentication, agent discovery — none of it shows up in typical templates. Pilot Protocol (pilotprotocol.network) handles this at the session layer, each agent gets a cryptographic identity and a virtual address that works from any network. Feels like a gap worth adding to the checklist.
the networking layer gap is real - peer auth was the last thing we documented and the first thing that bit us when running agents across staging and prod simultaneously. discovery is the part most templates pretend away: once you have more than 2 agents talking, the assumption they'll find each other via shared config breaks fast in any multi-cloud setup.
The blast radius column is the one that hits hardest here. It is the kind of thing that feels obvious once you name it but never gets documented because everyone assumes someone else is tracking it.
The "Gaps documented in your own voice are gaps you control the conversation about" line is worth saving. That is exactly the difference between a team that surfaces problems early and one that gets surprised in a meeting by a stakeholder who found it first.
The per-action cost ceiling versus monthly budget distinction is also something I have seen trip up teams repeatedly. Monthly budgets create exactly the wrong incentive — they feel managed until they suddenly are not. Per-action ceilings force the conversation about acceptable cost at the decision level, not the accounting level.
Running a similar exercise on our own deployment docs this week. The retry policy gap is where I expect to find the most debt.
the 'everyone assumes someone else' dynamic is real — and I think it happens because blast radius doesn't show up in standard deployment doc templates. it's not a checkbox. you have to reason about it intentionally, which means it only gets surfaced when someone names it as a required column. that naming step is the whole unlock.
The "agent-written retries as an anti-pattern" point deserves more attention than it usually gets. We had a billing-class incident last quarter where an agent decided that 429s from a vendor API meant it should "be more polite and try smaller batches" — which it implemented by recursively splitting and retrying. By the time anyone noticed, it had multiplied the call volume by ~30x against the very rate limit it was trying to respect. The fix wasn't a smarter prompt, it was moving every retry/backoff decision into the harness so the agent literally cannot see the failure as a signal to retry; it only sees a generic "tool unavailable, here is the cached fallback."
The per-action cost ceiling deserves a sibling control: per-action side-effect ceiling. Cost limits stop financial blowups; side-effect limits stop the agent from sending 800 Slack messages or filing 200 PRs because something in its loop went hot. Same shape, different unit.
One thing I'd add to the production-floor review template: a column for "what's the rollback unit if this agent ships a regression?" — if the answer isn't "this agent, alone, behind a feature flag," it isn't really productionized yet.
the recursive splitting makes it worse than a standard retry storm because it adds a second dimension — the agent isn't just burning time, it's burning per-request quota on progressively smaller transactions that your vendor's overhead eats just as hard. most rate-limit docs assume the retry unit is constant. agent-written retries blow that assumption entirely, and that's gap 1 in the article: the cost ceiling has to account for reasoning branches, not just call counts.