- Book: AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
You sized your agent for a five-step task. Step three needed a clarification. Step five was a tool retry. By step nine you have a context window the size of a small novel and a per-call cost that has tripled because cache writes keep happening on every turn. The bill for one conversation is no longer "a few cents." It is an order of magnitude over what you priced the feature at, and you only know that because someone happened to read the dashboard.
Someone has already written the postmortem-shaped version of this story (four agents, eleven days, $47K). This post is the version that runs before the postmortem. A hard token budget per conversation. Checked before every model call, it aborts the agent the moment the next request would push it over the line. Eighty lines of Python wrapped around the Anthropic SDK. The unit of accounting is the conversation, not the request, because the conversation is the thing that gets out of control.
What you actually need to count
Anthropic's messages.create returns a usage object on every response. As of 2026 it has four fields you care about for cost:
-
input_tokens: the prompt you just sent that was not read from cache. -
output_tokens: what the model generated. -
cache_creation_input_tokens: input tokens written into the prompt cache. -
cache_read_input_tokens: input tokens served from the cache.
Those four numbers are not interchangeable. Cache reads are cheaper than fresh input, and cache writes are more expensive than fresh input. If you sum them under one tokens counter, your USD math will be wrong by a factor that depends on how cache-friendly your prompt structure is. The Anthropic prompt caching docs cover the rates and the field names; check them before you ship because the multipliers move.
from dataclasses import dataclass, field
@dataclass
class Usage:
input_tokens: int = 0
output_tokens: int = 0
cache_read_input_tokens: int = 0
cache_creation_input_tokens: int = 0
def add(self, other: "Usage") -> None:
self.input_tokens += other.input_tokens
self.output_tokens += other.output_tokens
self.cache_read_input_tokens += (
other.cache_read_input_tokens
)
self.cache_creation_input_tokens += (
other.cache_creation_input_tokens
)
That is the running tally for one conversation. Reset it when the conversation ends, never share it across conversations.
The pricing table is a config value
Hard-coding rates works for a prototype. It breaks in production. Pricing changes, models get added, and the multipliers shift. Treat the table as configuration, load it from a file or environment, and put real numbers in only when you have read the Anthropic pricing page that morning.
@dataclass
class ModelRates:
input_per_m: float
output_per_m: float
cache_read_per_m: float
cache_write_per_m: float
def usd_for(usage: Usage, rates: ModelRates) -> float:
return (
usage.input_tokens * rates.input_per_m
+ usage.output_tokens * rates.output_per_m
+ usage.cache_read_input_tokens
* rates.cache_read_per_m
+ usage.cache_creation_input_tokens
* rates.cache_write_per_m
) / 1_000_000
One function, one source of truth for cost. Every gate downstream calls usd_for and compares to a ceiling. If pricing changes, you change ModelRates, not the agent loop.
The pre-call check is the whole point
A budget gate that checks the running total after each call has already overspent. The check has to be predictive: would the next call, in the worst case, push us over? For a Claude call the worst case is tokens_already_in_messages + max_tokens, because max_tokens is the upper bound on the response length you are about to authorize.
class BudgetExceeded(Exception):
pass
class TokenBudget:
def __init__(
self,
ceiling_usd: float,
rates: ModelRates,
):
self.ceiling_usd = ceiling_usd
self.rates = rates
self.usage = Usage()
def remaining_usd(self) -> float:
return self.ceiling_usd - usd_for(
self.usage, self.rates
)
def check(
self,
prompt_tokens_estimate: int,
max_tokens: int,
) -> None:
worst_case = Usage(
input_tokens=prompt_tokens_estimate,
output_tokens=max_tokens,
)
worst_usd = usd_for(worst_case, self.rates)
if worst_usd > self.remaining_usd():
raise BudgetExceeded(
f"next call could cost "
f"${worst_usd:.4f}, only "
f"${self.remaining_usd():.4f} left"
)
def record(self, usage: Usage) -> None:
self.usage.add(usage)
Two surfaces. check raises before the API request goes out, and record updates the tally after the response comes back. Order matters.
The prompt_tokens_estimate is whatever you can compute cheaply. The conservative version is len(serialised_messages) // 3 (a deliberate over-estimate: JSON-encoded messages have more punctuation than English prose, and you want the gate to err toward refusing the call). The accurate version uses Anthropic's client.messages.count_tokens if your latency budget allows the extra round trip. Pick one. If you pick the rough estimate, leave a comment that says so.
What "estimating output" actually means
You do not know how long the model's reply will be. You do know max_tokens, because you set it. Use it. A model asked for max_tokens=4096 might return 80 tokens of "no, that table doesn't exist" or 4096 tokens of citations and reasoning. Budget for the upper bound. If that makes the gate too pessimistic, the fix is to lower max_tokens for routine turns and raise it only for the final summary turn. Lying to yourself about how long the reply might be is not the fix.
The agent loop with the gate wired in
import json
import anthropic
MODEL = "claude-sonnet-4-5"
MAX_STEPS = 10
PER_CONVO_USD = 0.50
PER_CALL_MAX_TOKENS = 1024
client = anthropic.Anthropic()
def run_agent(prompt, tools, dispatch, rates):
budget = TokenBudget(PER_CONVO_USD, rates)
messages = [{"role": "user", "content": prompt}]
for step in range(1, MAX_STEPS + 1):
prompt_estimate = _rough_tokens(messages)
budget.check(
prompt_estimate,
PER_CALL_MAX_TOKENS,
)
resp = client.messages.create(
model=MODEL,
max_tokens=PER_CALL_MAX_TOKENS,
tools=tools,
messages=messages,
)
budget.record(_usage_from(resp))
if resp.stop_reason == "end_turn":
return _final_text(resp), budget.usage
if resp.stop_reason != "tool_use":
raise RuntimeError(
f"unexpected stop: {resp.stop_reason}"
)
messages.append(
{"role": "assistant",
"content": resp.content}
)
results = []
for block in resp.content:
if block.type != "tool_use":
continue
out = dispatch[block.name](**block.input)
results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": str(out),
})
messages.append(
{"role": "user", "content": results}
)
raise RuntimeError("step cap hit")
budget.check is the first thing in the loop body. A raise stops the request before it leaves the process; a pass lets the request through, and budget.record updates the tally with all four token classes from resp.usage.
The two helpers are five lines each:
def _usage_from(resp) -> Usage:
u = resp.usage
return Usage(
input_tokens=u.input_tokens,
output_tokens=u.output_tokens,
cache_read_input_tokens=getattr(
u, "cache_read_input_tokens", 0
),
cache_creation_input_tokens=getattr(
u, "cache_creation_input_tokens", 0
),
)
def _rough_tokens(messages) -> int:
return len(json.dumps(messages)) // 3
getattr with a default keeps the code working against SDK versions that have not added the cache fields yet. Replace with the real SDK attribute access once you pin a version.
When the cap hits mid-tool-loop
The agent has called two tools, the third response is in flight, the budget gate trips. What do you do?
Three sane policies, pick one and write it down:
-
Hard abort. Raise
BudgetExceededto the caller. The conversation ends, the user gets a "this question was too expensive to answer" message. Best for batch jobs and background work where partial output is worse than no output. -
Graceful summarise. Catch
BudgetExceededonce, reserve a small buffer (say $0.05) for one final non-tool call, ask the model to summarise what it has so far, return that. Best for user-facing chat where a partial answer is still useful. - Escalate. Catch the exception, log everything, ask a human (or a richer-budget agent) whether to extend the cap. Best for support workflows where a human operator can authorise more spend.
Do not silently lower max_tokens and retry until the call fits. That hides the symptom and produces truncated output the model has not been told is truncated.
Three lines you can drop in tonight
If the rest of this post is too much, the smallest version is:
budget.check(prompt_tokens_estimate, max_tokens)
resp = client.messages.create(...)
budget.record(_usage_from(resp))
Pre-call check. The call. Post-call record. A conversation that walks past the check has been authorised to spend the money; one that fails the check ends with a clean, named exception and a tally the on-call can read.
A conversation without a gate is one that spends whatever the model decides to spend. The model does not know what your unit economics are. Tell it.
If this was useful
The AI Agents Pocket Guide covers per-conversation budgets, the four token classes Anthropic actually bills against, the cache-write trap, and how to wire a graceful BudgetExceeded into a tool-using loop without leaving partial state behind. The chapter on resource gates pairs directly with the code above.

Top comments (0)