<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Michael Chen</title>
    <description>The latest articles on DEV Community by Michael Chen (@m24927605).</description>
    <link>https://dev.to/m24927605</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1553884%2F2b779c0c-a848-4e6a-b395-e7c81eec94be.jpg</url>
      <title>DEV Community: Michael Chen</title>
      <link>https://dev.to/m24927605</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/m24927605"/>
    <language>en</language>
    <item>
      <title>The Agent Spend Governance Gap</title>
      <dc:creator>Michael Chen</dc:creator>
      <pubDate>Sat, 23 May 2026 08:26:15 +0000</pubDate>
      <link>https://dev.to/m24927605/the-agent-spend-governance-gap-2kke</link>
      <guid>https://dev.to/m24927605/the-agent-spend-governance-gap-2kke</guid>
      <description>&lt;p&gt;&lt;em&gt;Why token counters aren't enough, and what a real solution looks like.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;At 02:47 UTC on a Tuesday, a customer-support agent at a mid-sized SaaS company hits a rate-limited internal tool. The retry policy kicks in. The agent loop re-plans, re-prompts,&lt;br&gt;
  re-tries — each retry a fresh &lt;code&gt;gpt-4o&lt;/code&gt; call with the full conversation history in context.&lt;/p&gt;

&lt;p&gt;By 03:27, one stuck conversation has consumed about $380 in tokens. By 04:00, three other tenants are doing the same thing, each blissfully unaware. The on-call SRE finds out at 09:00&lt;br&gt;
  the next morning when the OpenAI dashboard refreshes the invoice line.&lt;/p&gt;

&lt;p&gt;The post-mortem starts: &lt;em&gt;"We didn't know until the bill arrived."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is the standard failure mode of every agent system built in 2026. And the surprising thing is &lt;strong&gt;how thoroughly the existing observability stack fails to fix it&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;## The three layers that already exist&lt;/p&gt;

&lt;p&gt;If you go looking, there are standards for everything &lt;em&gt;around&lt;/em&gt; the bug above:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;OpenTelemetry GenAI semantic conventions&lt;/strong&gt; describe the LLM call beautifully — token counts, model name, latency, agent steps. As of March 2026, the semconv is the de facto standard&lt;br&gt;
for LLM tracing. It is also, by design, &lt;strong&gt;observability&lt;/strong&gt;. It tells you what happened. It cannot decline a call.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;FOCUS 1.0&lt;/strong&gt; (FinOps Open Cost &amp;amp; Usage Specification) standardizes how provider bills get reconciled into your data warehouse. Excellent for monthly chargeback. Arrives &lt;strong&gt;days&lt;/strong&gt;&lt;br&gt;
after the spending. Cannot decline a call.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;OAuth / OIDC / APS / AgentID / ERC-8004&lt;/strong&gt; all answer "who is this agent." Identity layers are essential and mature. None of them know whether the agent has $5 of budget left.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These three layers cover a square three sides of a problem. The fourth side is missing.&lt;/p&gt;

&lt;p&gt;## What the missing side actually does&lt;/p&gt;

&lt;p&gt;The missing side is &lt;strong&gt;pre-call budget enforcement&lt;/strong&gt; — and the pattern is already familiar to anyone who has shipped a payments integration. Stripe calls it auth/capture:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Authorize&lt;/strong&gt; — reserve the worst-case spend before any actual money moves.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capture&lt;/strong&gt; — once the operation succeeds and you know the real amount, commit it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Refund&lt;/strong&gt; the difference. Cancel on failure. Sign every step. Make it idempotent.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Applied to LLM tokens:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Before the agent's &lt;code&gt;chat.completions&lt;/code&gt; call, ask an enforcement authority: &lt;em&gt;"can this tenant afford &lt;code&gt;output_token=200&lt;/code&gt; against budget &lt;code&gt;acme-eng-2026-05&lt;/code&gt;?"&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;ALLOW → the provider gets called. DENY → the agent gets an &lt;code&gt;HTTP 403&lt;/code&gt;, the provider clock never starts, the invoice clock never starts.&lt;/li&gt;
&lt;li&gt;After the response, commit the &lt;strong&gt;real&lt;/strong&gt; &lt;code&gt;completion_tokens=87&lt;/code&gt;. Refund the 113 that didn't get used.&lt;/li&gt;
&lt;li&gt;Every Reserve and Commit emits a signed CloudEvent into an append-only audit log.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This isn't novel. Stripe shipped it in 2011. The novelty is that nobody applies it to LLM tokens — even though the failure mode (runaway loops at 02:47 burning hundreds of dollars in 40&lt;br&gt;
  minutes) is exactly the failure mode auth/capture was designed to prevent.&lt;/p&gt;

&lt;p&gt;## Why "we use LiteLLM budgets" is not the answer&lt;/p&gt;

&lt;p&gt;Most teams that have thought about this at all reach for the budget feature in their LLM gateway. LiteLLM has team budgets, agent iteration budgets, max_budget per key. Portkey has&lt;br&gt;
  budgets. Helicone has limits. Cloudflare AI Gateway has cost caps.&lt;/p&gt;

&lt;p&gt;These are useful. They are also, individually, none of the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Atomic.&lt;/strong&gt; Per-key counters race under concurrent calls. Two simultaneous requests can both pass a check that should have denied the second one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transactional.&lt;/strong&gt; A reservation that gets refunded if the call fails is a different primitive from a counter that gets incremented after the fact. Counters don't refund.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auditable.&lt;/strong&gt; A signed receipt that says "decision DENY because &lt;code&gt;BUDGET_EXHAUSTED&lt;/code&gt;, made at 02:47:13 against budget &lt;code&gt;acme-eng-2026-05&lt;/code&gt;, by authority &lt;code&gt;sg.acme.internal&lt;/code&gt;" is a different
thing from a row in a dashboard. One survives subpoena. The other doesn't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Portable.&lt;/strong&gt; Switching from LiteLLM to a self-hosted gateway should not lose your spend-governance contracts. Today, it does.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These products are gateway features, optimized for the vendor's product. There's no standard underneath. There needs to be one.&lt;/p&gt;

&lt;p&gt;## Five properties of a real solution&lt;/p&gt;

&lt;p&gt;A standard for agent spend governance — a real one, not a checkbox — has to do all five of these:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pre-call gating, not post-hoc accounting.&lt;/strong&gt; Detection at the 11th call, not the next morning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Atomic transactional reservations.&lt;/strong&gt; Reserve worst case, commit real, refund overshoot. Concurrent reserves never both succeed past the cap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cryptographically signed audit chain.&lt;/strong&gt; Every decision is a signed CloudEvent landing in append-only storage your SIEM can subscribe to.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Identity-layer-neutral.&lt;/strong&gt; Compose with APS, AgentID, ERC-8004, plain tenant strings, whatever your stack uses. Don't reinvent identity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provider-neutral.&lt;/strong&gt; Same wire protocol for OpenAI, Anthropic, Bedrock, self-hosted vLLM. A protocol your customers can implement against your service or someone else's.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I'd argue these are non-negotiable. Drop any one and you're back to a vendor-specific budget counter, which is what the industry has today and which is what produces the 09:00&lt;br&gt;
  post-mortems.&lt;/p&gt;

&lt;p&gt;## Where this goes next&lt;/p&gt;

&lt;p&gt;Three things are happening, more or less simultaneously, in May 2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An upstream canonical-verb set for budget reservation is already incubating in Tymofii Pidlisnyi's
&lt;a href="https://github.com/aeoess/agent-governance-vocabulary" rel="noopener noreferrer"&gt;&lt;code&gt;agent-governance-vocabulary&lt;/code&gt;&lt;/a&gt;. The file is
&lt;a href="https://github.com/aeoess/agent-governance-vocabulary/blob/main/crosswalk/budget_reservation.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;crosswalk/budget_reservation.yaml&lt;/code&gt;&lt;/a&gt; — &lt;code&gt;crosswalk_type: domain_incubation&lt;/code&gt;, two
production implementations crosswalked as of 2026-05-13 (goodmeta and Cycles), with a documented promotion path that lands the verbs as canonical once a third production implementer
surfaces and the &lt;code&gt;proposed&lt;/code&gt; verbs reach two implementers. The verb set — &lt;code&gt;reserve&lt;/code&gt;, &lt;code&gt;commit&lt;/code&gt;, &lt;code&gt;release&lt;/code&gt;, &lt;code&gt;refund&lt;/code&gt;, &lt;code&gt;query_budget&lt;/code&gt; — is the right place to anchor.&lt;/li&gt;
&lt;li&gt;A draft wire protocol — the &lt;a href="https://agenticspendguard.dev/docs/specs/agent-spend-protocol/" rel="noopener noreferrer"&gt;Agent Spend Protocol (ASP) Draft-01&lt;/a&gt; — is the agent-runtime binding of that upstream verb
set. Apache 2.0. Reserve / Commit / Release / Refund / Audit semantics, signature discipline, transaction model. Not bound to any one implementation.&lt;/li&gt;
&lt;li&gt;An &lt;a href="https://agenticspendguard.dev/docs/specs/otel-genai-extension/" rel="noopener noreferrer"&gt;OpenTelemetry GenAI extension proposal&lt;/a&gt; puts the spend-decision events on the same GenAI span as the provider call
itself. Existing OTel dashboards keep working; spend governance becomes just another span event.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is also a reference implementation, &lt;a href="https://agenticspendguard.dev" rel="noopener noreferrer"&gt;SpendGuard&lt;/a&gt;, built by the same group. It is Apache 2.0 and runs as a Rust sidecar today, with adapters for&lt;br&gt;
  LiteLLM, OpenAI Agents SDK, LangChain, LangGraph, Pydantic-AI, and Microsoft Agent Governance Toolkit. But the reference implementation is not the protocol. The point of writing the spec&lt;br&gt;
   is to make sure alternative implementations are possible — and welcome.&lt;/p&gt;

&lt;p&gt;## What I'd like from you&lt;/p&gt;

&lt;p&gt;If you run agent workloads in production, the most useful thing you can do is read the &lt;a href="https://agenticspendguard.dev/docs/specs/agent-spend-protocol/" rel="noopener noreferrer"&gt;ASP draft&lt;/a&gt; and open an issue&lt;br&gt;
  against anything that doesn't match how your real-world budgets work. Multi-Authority settlement, multi-provider FX, DEGRADE-cap routing — these are all open questions in Draft-01&lt;br&gt;
  because the right answer comes from people who have hit the walls.&lt;/p&gt;

&lt;p&gt;If you're working on an LLM gateway, an enforcement product, or an observability platform — the OTel extension is the path of lowest friction to interop. We'd love feedback in the OTel&lt;br&gt;
  GenAI SIG.&lt;/p&gt;

&lt;p&gt;If you're working on adjacent agent governance infrastructure — identity, attestation, settlement, execution boundaries — let's get terminology aligned now so we don't end up with seven&lt;br&gt;
  incompatible vocabularies in two years. The vocabulary repo above is the natural place.&lt;/p&gt;

&lt;p&gt;The agent economy is being built. The instrument panel needs more than tokens-per-second.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Comments welcome on &lt;a href="https://github.com/m24927605/agentic-spendguard/issues" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; or by replying to this post.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
