DEV Community

Cover image for Your AI agent re-sends 80% of your budget every loop
Albert Alov
Albert Alov

Posted on

Your AI agent re-sends 80% of your budget every loop

Your ReAct agent runs 15 turns. By turn 10, input_tokens is 87K. You're re-sending the entire conversation history every single iteration.

That's not generation cost. That's re-reading cost. And no observability tool shows you the trajectory.

We built a metric for it. Then we built a guard that stops the bleed before it kills your budget. Here's the problem, the math, and the fix.


The invisible cost of agent loops

Here's how a typical ReAct agent works:

Turn  1: system prompt + user query                       →    1,200 input tokens
Turn  2: + assistant response + tool result                →    3,800 input tokens
Turn  5: + three more rounds of think/act/observe          →   15,000 input tokens
Turn 10: the entire conversation so far                    →   87,000 input tokens
Turn 15: approaching the context limit                     →  152,000 input tokens
Enter fullscreen mode Exit fullscreen mode

Every turn re-sends everything. The system prompt. The user's question. Every assistant response. Every tool result. The LLM has no memory between calls — you're paying to "remind" it what happened.

On GPT-4o ($2.50/M input tokens):

Turn Input tokens Cumulative input cost New generation cost
1 1,200 $0.003 $0.002
5 15,000 $0.07 $0.002
10 87,000 $0.29 $0.003
15 152,000 $0.67 $0.003

The generation column barely moves. You're paying $0.67 to re-read context, and $0.03 for the model to actually think. That's 96% overhead.

Switch to Claude Opus ($15/M input) and those numbers are 6x worse. A 15-turn agent run costs $4 in re-reads alone.

Why your dashboard doesn't show this

Open your observability tool. You'll see total tokens per request, cost per request, latency per request. All per-request. All in isolation.

None of these tell you:

  • What percentage of the context window is used at each turn
  • How fast utilization is growing
  • When you'll hit the model's limit
  • That 80% of your input tokens are the same conversation sent again

Your dashboard shows snapshots. It doesn't show the trajectory — the runaway growth curve eating your budget across a multi-turn session.

This is the metric that was missing.

Context utilization: one ratio that changes everything

utilization = input_tokens / max_context_tokens
Enter fullscreen mode Exit fullscreen mode

Input tokens divided by the model's maximum context window. A number between 0 and 1.

Turn  1: utilization = 0.01   — plenty of room
Turn  5: utilization = 0.12   — still fine
Turn 10: utilization = 0.68   — growing fast
Turn 13: utilization = 0.85   — danger zone
Turn 15: utilization = 0.95   — one tool result away from hitting the wall
Enter fullscreen mode Exit fullscreen mode

Plot this on a chart and you see the growth curve before it becomes a cost problem. At a glance you know: how close you are to the limit, how fast you're approaching it, and which agent is at risk.

How we record it

In toad-eye, context utilization is calculated automatically after every LLM call. If the model is in the pricing table, the metric is emitted — you don't do anything:

const pricing = getModelPricing(model);
if (pricing?.maxContextTokens && output.inputTokens > 0) {
  const utilization = output.inputTokens / pricing.maxContextTokens;

  // On the span — queryable in Jaeger
  span.setAttribute("gen_ai.toad_eye.context_utilization", utilization);

  // As a histogram — P95/P99 in Grafana
  recordContextUtilization(utilization, provider, model);
}
Enter fullscreen mode Exit fullscreen mode

The pricing table knows every major model's context window:

"gpt-4o":           { maxContextTokens: 128_000 }
"gpt-4.1":          { maxContextTokens: 1_047_576 }
"claude-opus-4":    { maxContextTokens: 200_000 }
"claude-sonnet-4":  { maxContextTokens: 200_000 }
"gemini-2.5-pro":   { maxContextTokens: 1_048_576 }
Enter fullscreen mode Exit fullscreen mode

Custom model? Override it:

setCustomPricing({
  "my-finetuned-gpt4": {
    inputPer1M: 3.0,
    outputPer1M: 12.0,
    maxContextTokens: 32_768,
  },
});
Enter fullscreen mode Exit fullscreen mode

Context guard: warn before it's too late

A metric tells you what happened. A guard stops it from happening.

initObservability({
  serviceName: "my-agent",
  contextGuard: {
    warnAt: 0.8,    // console.warn at 80%
    blockAt: 0.95,  // throw before the LLM call at 95%
  },
});
Enter fullscreen mode Exit fullscreen mode

At 80%, you get a warning:

toad-eye: context window 82% full for gpt-4o (104,960 / 128,000 tokens)
Enter fullscreen mode Exit fullscreen mode

At 95%, toad-eye throws a ToadContextExceededError — before the call, not after. Your agent catches it and can act:

try {
  const response = await traceLLMCall(input, () => llm.chat(messages));
} catch (err) {
  if (err instanceof ToadContextExceededError) {
    // err.utilization: 0.96
    // err.inputTokens: 122,880
    // err.maxContextTokens: 128,000
    messages = await compressHistory(messages);
    // retry with compressed context
  }
}
Enter fullscreen mode Exit fullscreen mode

The error carries everything you need: current utilization, the threshold, the model, token counts. No guessing.

When the block fires, toad-eye records it in three places: a span event in Jaeger, a counter metric in Grafana, and a warning in your application logs. One event, full visibility.

What to do when utilization is high

The metric and guard tell you there's a problem. Three practical fixes:

Summarize old turns. After N turns, replace the conversation history with an LLM-generated summary. Trade 50K tokens of history for a 2K summary. The agent loses some detail but stays under budget.

Compress tool results. Tool results are the biggest token hogs — a web search returning 10K tokens of HTML. Summarize tool results before adding them to context. Or store full results externally and put just a reference in context.

Route to a bigger model. When utilization crosses a threshold, switch models. Running on gpt-4o (128K)? Route to gpt-4.1 (1M) for the final turns:

if (err instanceof ToadContextExceededError && err.model === "gpt-4o") {
  return callWithModel("gpt-4.1", messages);
}
Enter fullscreen mode Exit fullscreen mode

And sometimes the right answer is to stop. If your agent hasn't converged in 10 turns, adding 5 more turns of context won't help — it'll just cost more.

Where this came from

After publishing article #3 on OTel semantic conventions, a reader named @jidong left a comment:

"Context window usage per turn matters more than total tokens in agent loops."

They were right. Total tokens is a number. Context utilization is a trajectory. The first tells you what happened. The second tells you what's about to happen.

We built context_utilization the next week. Here's the tracking issue — shaped directly by that comment.


Quick checklist

If you're running agents in production:

  • Monitor input_tokens per turn, not just per session
  • Calculate context utilization: input_tokens / max_context_tokens
  • Alert when P95 utilization crosses 0.7
  • Guard at 80% (warn) and 95% (block)
  • Have a compression strategy ready before you hit the limit
  • Test with 10+ turn runs — the problem only shows up at scale

The metric is simple. The insight it gives you is not.


Previous articles:

toad-eye — open-source LLM observability, OTel-native: GitHub · npm

🐸👁️

Top comments (0)