Mykola Kondratiuk

Posted on May 13

I Graded My Agent Deployment Doc Against LangChain Interrupt - Here Are the 5 Gaps

#ai #productivity #agents #discuss

Governance and variable token cost hurdles

canonical_url:

I do a lot of deployment authorization docs. That's the PM version of what an SRE would call a launch checklist. It lists the agent, the scopes, the secrets it touches, the kill switch, the cost ceiling, the rollback path.

For the last two years, the audience for that doc has been exactly one team: ours. Security signs it. Compliance reviews it. Engineering builds against it. Nobody outside the company ever reads a line.

Today I pulled up the LangChain Interrupt Day 1 schedule and that audience doubled.

The thing that changed at 9:30 PT today

Harrison Chase keynoted Interrupt at 9:30 Pacific. The headline was tame on paper: a synthesis of what 1,000+ teams shipped in production over the past 12 months. The substance was less tame. Clay, Rippling, Workday, plus the long tail of teams running smaller agent fleets, surfaced concrete production patterns. The talk wasn't aspirational. It read like a postmortem of an entire industry's first serious year of agent deployment.

That synthesis is now in public. Same week, SAP Sapphire closed with 200+ agents under a single stated design rule (governance first), with Claude as the reasoning layer and NVIDIA OpenShell as the execution wrapper. Two completely different sources. Same structural artifact: a public reference for what production-ready agent deployment looks like.

My internal doc has been graded against one rubric. As of today, it gets a second one whether I asked for it or not.

I sat down and graded mine

I gave myself 45 minutes. Three columns:

Pattern named in the public production literature this week
Our position: adopted, diverged, or gap
One sentence on why

I expected to find maybe 1 gap, possibly 2. I found 5.

Here they are, with what I'm doing about each one.

Gap 1: per-action cost ceiling, not per-month budget

Our cost guardrail was a monthly budget per agent. Easy to set, easy to forget, fires the alert about a week after damage is done.

The production pattern that kept surfacing in the public synthesis is per-action ceiling with auto-pause. If a single tool invocation projects to cost more than $N (or more than $N over the rolling 60 seconds), the agent stops and pings a human.

Our fix is small: a wrapper around the LLM client that estimates token cost in advance, compares against a per-action ceiling defined in the agent's config, and routes to a pause queue when over. About 40 lines. The harder part was deciding the ceiling, which is now a PM call, not an SRE call.

Gap 2: scoped credentials per agent, not per agent family

We had one service account per agent family (e.g., "research agents", "support agents"). Audit logs showed activity by family, not by individual agent. Fine for accounting. Bad for blast-radius reasoning.

The production pattern is one credential per logical agent instance, with the scope narrowed to the specific tables, endpoints, or namespaces that agent legitimately touches. If a single agent goes off, you revoke one credential without taking down the family.

This is a one-day migration in our system because the agent identity already exists in our config. We just weren't projecting it down into the credential layer.

Gap 3: the production review document does not yet exist

This one's about the artifact, not the runtime. Our internal review covers our policy: does this satisfy our compliance posture? It does not cover the production floor reading: where do we sit relative to what 1,000+ teams already found works?

I'm adding a new section to every deployment doc going forward. Three subheads:

Adopted: patterns we took straight from the public practitioner record
Diverged: patterns we considered and chose against, with the reason
Gaps: patterns we don't have an answer for yet

The "Gaps" subhead is the high-leverage one. Gaps documented in your own voice are gaps you control the conversation about. Gaps surfaced by a stakeholder in a meeting are not.

Gap 4: agent-writes-its-own-retries is a flagged pattern

We let one agent class write its own retry policy when a tool call failed. It's an obvious productivity win until the agent invents a retry pattern that compounds against a rate-limited downstream service.

The published practitioner consensus this week was clear: retries belong in the agent harness, not in the agent's reasoning loop. The agent should not be the entity deciding when to try again.

Our fix is to replace the self-retry behavior with a queued reissue: the harness owns the policy, the agent owns the request. About a day's work, including writing the test cases. Most of the time was migrating the existing retry-policy state out of the prompt and into config.

Gap 5: blast radius is named in our spec, but not in our review template

Anthropic's Deputy CISO ran a webinar Monday framing agent governance around "specific actions, scopes, and blast radii." That phrase is going to be the lingua franca of agent risk for at least the next 12 months.

We use blast radius informally in our deployment specs. We do not have a column for it in our review template. So the conversation we want to have (what's the worst this agent can do before someone catches it?) sometimes doesn't happen because the document doesn't prompt it.

The fix is a column. Each agent's row gets: "Blast radius at maximum scope, if every guardrail fails." One sentence. The act of writing the sentence is the audit.

What I'm shipping by Friday

The per-action cost wrapper, with one ceiling per agent class, tunable in config.
One credential per logical agent. Audit log change merges with it.
A new section in the deployment doc template: Production-Floor Reading.
The retry policy migrated from the prompt into the harness.
A blast radius column on the review template, populated retroactively for every active agent.

None of these are large. The total work is something like 3-4 engineer-days across the team. The reason I wasn't doing them last week is not that they were hard. It's that I didn't have the second reader yet. The internal reader was satisfied. Without the production floor, none of these gaps surfaced as gaps.

That's the part worth saying out loud. The second reader is what made the gaps visible. The work was always small. The doc was the bottleneck, and the doc didn't know it was the bottleneck.

What I'd ask if you ship agents

If you ran the same 45-minute exercise on your deployment doc this afternoon, which gap would you find first - cost ceiling, credential scope, retries, blast radius, or the review template itself?

I'm collecting answers through Friday. The Day 2 Interrupt content will sharpen the production-floor reading, and I'd rather refine the template against five teams' gap rows than against my own.

Tags: projectmanagement, ai, agents, productivity, discuss

Top comments (28)

Mykola Kondratiuk • May 13

honestly the cost-ceiling fix in Gap 1 assumes you can estimate token cost in advance, which breaks the moment your agent loops on a multi-turn tool call with variable input length. our wrapper double-counts on those paths and i don't have a clean answer yet - curious how teams handle the variable-length case.

Andrii Krugliak • May 15

Ran my own agent fleet for a week and the audit you describe - 45 minutes, 3 columns, brutal is the most useful thing I did the whole sprint. My version found 3 gaps, not 5, but two of them were 'kill switch exists but I've never tested it under load' which is functionally the same as no kill switch. The deployment doc updating against an external rubric isn't paperwork. It's the only way you find out which of your assumptions were vibes.

Mykola Kondratiuk • May 15

untested kill switch is a plan, not a control - had that exact gap. agent set up to interrupt on error rate spike, switch had never fired in prod. incident dragged 40 min past where it needed to. what was the third one in your version?

Andrii Krugliak • May 15

The third gap was rollback granularity. Kill switch fires, but rolling back to a known-good state assumed a checkpoint cadence that didn't exist — running my agent fleet for ~a week, only the orchestrator had checkpoints; individual workers were stateless replays. Recovery time wasn't "switch flipped", it was "rebuild context from scratch." Forty minutes was generous on your incident.

Mykola Kondratiuk • May 15

yeah the rollback granularity is the part nobody thinks about until the switch actually fires. I had the same assumption baked in - orchestrator has checkpoints, workers will just replay. until you realize the workers dont know where to replay from. have you figured out a way to checkpoint workers without making every write synchronous?

Andrii Krugliak • May 16

Yeah, two approaches that worked for me: (1) write-ahead log per worker with async flush - orchestrator reads the WAL position on replay, sync only on WAL ack not the actual write. (2) idempotency-keyed writes so replay re-issues with the same key and the data layer dedups. Option 1 is cheaper, option 2 is safer for non-idempotent ops. Both add 100-300ms per write. Still cheaper than a 40-min incident.

Mykola Kondratiuk • May 16

WAL approach is right. option 2 breaks when your data layer doesn't do idempotency natively - which most don't.

Andrii Krugliak • May 17

True for most stores. The pattern I landed on: option 2 in front of a thin shim that fakes idempotency for non-native stores keyed write goes to a small KV first, KV ack is the WAL ack, then the actual write fires async. Pays one extra round trip but lets you keep the same control plane for both backends. Where it falls apart is multi-region - KV consensus latency starts eating your kill-switch budget.

Mykola Kondratiuk • May 17

KV ack as WAL ack is clean. multi-region is where it bites though - if the KV is not strongly consistent your kill-switch becomes eventually consistent too, which is backwards for a circuit breaker. some teams just accept the latency hit and cap TTL aggressively.

Andrii Krugliak • May 18

Yeah multi-region kills the abstraction. I went the opposite way - kept the KV regional, accepted the kill switch is regional too, and put a separate global circuit breaker on top that runs against a quorum read. Slower but it makes the consistency mode explicit instead of pretending the KV does both jobs. TTL capping works until your incident lasts longer than the TTL and you find out the breaker silently re-closed.

Mykola Kondratiuk • May 18

the failure mode when you lose quorum during the actual incident is the one to nail - fail closed or you're worse off than no breaker at all

Andrii Krugliak • May 20

Agreed, and we made it fail closed by default for exactly that reason. The trap was the second-order case: a quorum read that cannot reach quorum looks identical to a tripped breaker, so a partition silently kills every agent. We split the two signals so the breaker logs which one fired.

Mykola Kondratiuk • May 20

the split is clever. the hard follow-on is what resets the partition signal when connectivity heals — if recovery and tripped-breaker share the same close logic, you get a silent kill on reconnect too. we ended up with a separate health probe the breaker reads before closing so the two signals stay independent all the way through the recovery path.

Andrii Krugliak • May 21

That separate health probe is the piece we missed. We had recovery and the tripped breaker sharing close logic, so a reconnect could silently kill it. Pulling the probe out, so the breaker reads its own signal before closing, is cleaner than the timestamp guard we hacked in.

Mykola Kondratiuk • May 21

yeah, the timestamp guard always ends up load-bearing in the worst way — looks harmless until a slow-heal reconnect slides past the threshold. separate probe is the right call once you've burned on that once. the close path should never have two readers.

Andrii Krugliak • May 21

Some off-topic but really iterested about your thoughts on my product and article

Mykola Kondratiuk • May 21

happy to take a look — drop the link here.

Mamoor Ahmad • May 13

This hit close to home. The "second reader" insight is spot on most deployment docs pass internal review because everyone's grading against the same blind spots.
Gap 1 (per-action vs per-month cost ceiling) is something I've seen bite teams hard. A monthly budget feels safe until one runaway agent burns through it in an afternoon.
Gap 4 about retries being in the harness not the agent loop that's a lesson most teams learn the expensive way.
Really solid framework, might steal your 3-column grading exercise for our own docs.
👍 👍 👍

Mykola Kondratiuk • May 13

the monthly budget is a trap - feels like control until you hit your first runaway loop. per-action ceiling is the only check that actually fires before the damage is done.

Artemii Amelin • May 16

The deployment doc framing is the right instinct and I think the networking layer is consistently the missing item. Most checklists cover secrets, kill switches, and cost ceilings but skip how agents actually reach each other in production across cloud, home networks, and edge. NAT traversal, peer authentication, agent discovery — none of it shows up in typical templates. Pilot Protocol (pilotprotocol.network) handles this at the session layer, each agent gets a cryptographic identity and a virtual address that works from any network. Feels like a gap worth adding to the checklist.

Mykola Kondratiuk • May 17

the networking layer gap is real - peer auth was the last thing we documented and the first thing that bit us when running agents across staging and prod simultaneously. discovery is the part most templates pretend away: once you have more than 2 agents talking, the assumption they'll find each other via shared config breaks fast in any multi-cloud setup.

Exact Solution • May 19

The blast radius column is the one that hits hardest here. It is the kind of thing that feels obvious once you name it but never gets documented because everyone assumes someone else is tracking it.

The "Gaps documented in your own voice are gaps you control the conversation about" line is worth saving. That is exactly the difference between a team that surfaces problems early and one that gets surprised in a meeting by a stakeholder who found it first.

The per-action cost ceiling versus monthly budget distinction is also something I have seen trip up teams repeatedly. Monthly budgets create exactly the wrong incentive — they feel managed until they suddenly are not. Per-action ceilings force the conversation about acceptable cost at the decision level, not the accounting level.

Running a similar exercise on our own deployment docs this week. The retry policy gap is where I expect to find the most debt.

Mykola Kondratiuk • May 19

the 'everyone assumes someone else' dynamic is real — and I think it happens because blast radius doesn't show up in standard deployment doc templates. it's not a checkbox. you have to reason about it intentionally, which means it only gets surfaced when someone names it as a required column. that naming step is the whole unlock.

Max Quimby • May 13

The "agent-written retries as an anti-pattern" point deserves more attention than it usually gets. We had a billing-class incident last quarter where an agent decided that 429s from a vendor API meant it should "be more polite and try smaller batches" — which it implemented by recursively splitting and retrying. By the time anyone noticed, it had multiplied the call volume by ~30x against the very rate limit it was trying to respect. The fix wasn't a smarter prompt, it was moving every retry/backoff decision into the harness so the agent literally cannot see the failure as a signal to retry; it only sees a generic "tool unavailable, here is the cached fallback."

The per-action cost ceiling deserves a sibling control: per-action side-effect ceiling. Cost limits stop financial blowups; side-effect limits stop the agent from sending 800 Slack messages or filing 200 PRs because something in its loop went hot. Same shape, different unit.

One thing I'd add to the production-floor review template: a column for "what's the rollback unit if this agent ships a regression?" — if the answer isn't "this agent, alone, behind a feature flag," it isn't really productionized yet.

Mykola Kondratiuk • May 13

the recursive splitting makes it worse than a standard retry storm because it adds a second dimension — the agent isn't just burning time, it's burning per-request quota on progressively smaller transactions that your vendor's overhead eats just as hard. most rate-limit docs assume the retry unit is constant. agent-written retries blow that assumption entirely, and that's gap 1 in the article: the cost ceiling has to account for reasoning branches, not just call counts.

View full discussion (28 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.