Your ReAct agent runs 15 turns. By turn 10, input_tokens is 87K. You're re-sending the entire conversation history every single iteration.
That's not generation cost. That's re-reading cost. And no observability tool shows you the trajectory.
We built a metric for it. Then we built a guard that stops the bleed before it kills your budget. Here's the problem, the math, and the fix.
The invisible cost of agent loops
Here's how a typical ReAct agent works:
Turn 1: system prompt + user query → 1,200 input tokens
Turn 2: + assistant response + tool result → 3,800 input tokens
Turn 5: + three more rounds of think/act/observe → 15,000 input tokens
Turn 10: the entire conversation so far → 87,000 input tokens
Turn 15: approaching the context limit → 152,000 input tokens
Every turn re-sends everything. The system prompt. The user's question. Every assistant response. Every tool result. The LLM has no memory between calls — you're paying to "remind" it what happened.
On GPT-4o ($2.50/M input tokens):
| Turn | Input tokens | Cumulative input cost | New generation cost |
|---|---|---|---|
| 1 | 1,200 | $0.003 | $0.002 |
| 5 | 15,000 | $0.07 | $0.002 |
| 10 | 87,000 | $0.29 | $0.003 |
| 15 | 152,000 | $0.67 | $0.003 |
The generation column barely moves. You're paying $0.67 to re-read context, and $0.03 for the model to actually think. That's 96% overhead.
Switch to Claude Opus ($15/M input) and those numbers are 6x worse. A 15-turn agent run costs $4 in re-reads alone.
Why your dashboard doesn't show this
Open your observability tool. You'll see total tokens per request, cost per request, latency per request. All per-request. All in isolation.
None of these tell you:
- What percentage of the context window is used at each turn
- How fast utilization is growing
- When you'll hit the model's limit
- That 80% of your input tokens are the same conversation sent again
Your dashboard shows snapshots. It doesn't show the trajectory — the runaway growth curve eating your budget across a multi-turn session.
This is the metric that was missing.
Context utilization: one ratio that changes everything
utilization = input_tokens / max_context_tokens
Input tokens divided by the model's maximum context window. A number between 0 and 1.
Turn 1: utilization = 0.01 — plenty of room
Turn 5: utilization = 0.12 — still fine
Turn 10: utilization = 0.68 — growing fast
Turn 13: utilization = 0.85 — danger zone
Turn 15: utilization = 0.95 — one tool result away from hitting the wall
Plot this on a chart and you see the growth curve before it becomes a cost problem. At a glance you know: how close you are to the limit, how fast you're approaching it, and which agent is at risk.
How we record it
In toad-eye, context utilization is calculated automatically after every LLM call. If the model is in the pricing table, the metric is emitted — you don't do anything:
const pricing = getModelPricing(model);
if (pricing?.maxContextTokens && output.inputTokens > 0) {
const utilization = output.inputTokens / pricing.maxContextTokens;
// On the span — queryable in Jaeger
span.setAttribute("gen_ai.toad_eye.context_utilization", utilization);
// As a histogram — P95/P99 in Grafana
recordContextUtilization(utilization, provider, model);
}
The pricing table knows every major model's context window:
"gpt-4o": { maxContextTokens: 128_000 }
"gpt-4.1": { maxContextTokens: 1_047_576 }
"claude-opus-4": { maxContextTokens: 200_000 }
"claude-sonnet-4": { maxContextTokens: 200_000 }
"gemini-2.5-pro": { maxContextTokens: 1_048_576 }
Custom model? Override it:
setCustomPricing({
"my-finetuned-gpt4": {
inputPer1M: 3.0,
outputPer1M: 12.0,
maxContextTokens: 32_768,
},
});
Context guard: warn before it's too late
A metric tells you what happened. A guard stops it from happening.
initObservability({
serviceName: "my-agent",
contextGuard: {
warnAt: 0.8, // console.warn at 80%
blockAt: 0.95, // throw before the LLM call at 95%
},
});
At 80%, you get a warning:
toad-eye: context window 82% full for gpt-4o (104,960 / 128,000 tokens)
At 95%, toad-eye throws a ToadContextExceededError — before the call, not after. Your agent catches it and can act:
try {
const response = await traceLLMCall(input, () => llm.chat(messages));
} catch (err) {
if (err instanceof ToadContextExceededError) {
// err.utilization: 0.96
// err.inputTokens: 122,880
// err.maxContextTokens: 128,000
messages = await compressHistory(messages);
// retry with compressed context
}
}
The error carries everything you need: current utilization, the threshold, the model, token counts. No guessing.
When the block fires, toad-eye records it in three places: a span event in Jaeger, a counter metric in Grafana, and a warning in your application logs. One event, full visibility.
What to do when utilization is high
The metric and guard tell you there's a problem. Three practical fixes:
Summarize old turns. After N turns, replace the conversation history with an LLM-generated summary. Trade 50K tokens of history for a 2K summary. The agent loses some detail but stays under budget.
Compress tool results. Tool results are the biggest token hogs — a web search returning 10K tokens of HTML. Summarize tool results before adding them to context. Or store full results externally and put just a reference in context.
Route to a bigger model. When utilization crosses a threshold, switch models. Running on gpt-4o (128K)? Route to gpt-4.1 (1M) for the final turns:
if (err instanceof ToadContextExceededError && err.model === "gpt-4o") {
return callWithModel("gpt-4.1", messages);
}
And sometimes the right answer is to stop. If your agent hasn't converged in 10 turns, adding 5 more turns of context won't help — it'll just cost more.
Where this came from
After publishing article #3 on OTel semantic conventions, a reader named @jidong left a comment:
"Context window usage per turn matters more than total tokens in agent loops."
They were right. Total tokens is a number. Context utilization is a trajectory. The first tells you what happened. The second tells you what's about to happen.
We built context_utilization the next week. Here's the tracking issue — shaped directly by that comment.
Quick checklist
If you're running agents in production:
- Monitor
input_tokensper turn, not just per session - Calculate context utilization:
input_tokens / max_context_tokens - Alert when P95 utilization crosses 0.7
- Guard at 80% (warn) and 95% (block)
- Have a compression strategy ready before you hit the limit
- Test with 10+ turn runs — the problem only shows up at scale
The metric is simple. The insight it gives you is not.
Previous articles:
- #1: My AI bot burned through my API budget overnight
- #2: I audited my tool, fixed 44 bugs — and it still didn't work
- #3: OpenTelemetry just standardized LLM tracing
- #4: Your LLM streaming traces are lying to you
toad-eye — open-source LLM observability, OTel-native: GitHub · npm
🐸👁️
Top comments (0)