Albert Alov

Posted on Mar 27

Your AI agent re-sends 80% of your budget every loop

#ai #opentelemetry #typescript #observability

Your ReAct agent runs 15 turns. By turn 10, input_tokens is 87K. You're re-sending the entire conversation history every single iteration.

That's not generation cost. That's re-reading cost. And no observability tool shows you the trajectory.

We built a metric for it. Then we built a guard that stops the bleed before it kills your budget. Here's the problem, the math, and the fix.

The invisible cost of agent loops

Here's how a typical ReAct agent works:

Turn  1: system prompt + user query                       →    1,200 input tokens
Turn  2: + assistant response + tool result                →    3,800 input tokens
Turn  5: + three more rounds of think/act/observe          →   15,000 input tokens
Turn 10: the entire conversation so far                    →   87,000 input tokens
Turn 15: approaching the context limit                     →  152,000 input tokens

Every turn re-sends everything. The system prompt. The user's question. Every assistant response. Every tool result. The LLM has no memory between calls — you're paying to "remind" it what happened.

On GPT-4o ($2.50/M input tokens):

Turn	Input tokens	Cumulative input cost	New generation cost
1	1,200	$0.003	$0.002
5	15,000	$0.07	$0.002
10	87,000	$0.29	$0.003
15	152,000	$0.67	$0.003

The generation column barely moves. You're paying $0.67 to re-read context, and $0.03 for the model to actually think. That's 96% overhead.

Switch to Claude Opus ($15/M input) and those numbers are 6x worse. A 15-turn agent run costs $4 in re-reads alone.

Why your dashboard doesn't show this

Open your observability tool. You'll see total tokens per request, cost per request, latency per request. All per-request. All in isolation.

None of these tell you:

What percentage of the context window is used at each turn
How fast utilization is growing
When you'll hit the model's limit
That 80% of your input tokens are the same conversation sent again

Your dashboard shows snapshots. It doesn't show the trajectory — the runaway growth curve eating your budget across a multi-turn session.

This is the metric that was missing.

Context utilization: one ratio that changes everything

utilization = input_tokens / max_context_tokens

Input tokens divided by the model's maximum context window. A number between 0 and 1.

Turn  1: utilization = 0.01   — plenty of room
Turn  5: utilization = 0.12   — still fine
Turn 10: utilization = 0.68   — growing fast
Turn 13: utilization = 0.85   — danger zone
Turn 15: utilization = 0.95   — one tool result away from hitting the wall

Plot this on a chart and you see the growth curve before it becomes a cost problem. At a glance you know: how close you are to the limit, how fast you're approaching it, and which agent is at risk.

How we record it

In toad-eye, context utilization is calculated automatically after every LLM call. If the model is in the pricing table, the metric is emitted — you don't do anything:

const pricing = getModelPricing(model);
if (pricing?.maxContextTokens && output.inputTokens > 0) {
  const utilization = output.inputTokens / pricing.maxContextTokens;

  // On the span — queryable in Jaeger
  span.setAttribute("gen_ai.toad_eye.context_utilization", utilization);

  // As a histogram — P95/P99 in Grafana
  recordContextUtilization(utilization, provider, model);
}

The pricing table knows every major model's context window:

"gpt-4o":           { maxContextTokens: 128_000 }
"gpt-4.1":          { maxContextTokens: 1_047_576 }
"claude-opus-4":    { maxContextTokens: 200_000 }
"claude-sonnet-4":  { maxContextTokens: 200_000 }
"gemini-2.5-pro":   { maxContextTokens: 1_048_576 }

Custom model? Override it:

setCustomPricing({
  "my-finetuned-gpt4": {
    inputPer1M: 3.0,
    outputPer1M: 12.0,
    maxContextTokens: 32_768,
  },
});

Context guard: warn before it's too late

A metric tells you what happened. A guard stops it from happening.

initObservability({
  serviceName: "my-agent",
  contextGuard: {
    warnAt: 0.8,    // console.warn at 80%
    blockAt: 0.95,  // throw before the LLM call at 95%
  },
});

At 80%, you get a warning:

toad-eye: context window 82% full for gpt-4o (104,960 / 128,000 tokens)

At 95%, toad-eye throws a ToadContextExceededError — before the call, not after. Your agent catches it and can act:

try {
  const response = await traceLLMCall(input, () => llm.chat(messages));
} catch (err) {
  if (err instanceof ToadContextExceededError) {
    // err.utilization: 0.96
    // err.inputTokens: 122,880
    // err.maxContextTokens: 128,000
    messages = await compressHistory(messages);
    // retry with compressed context
  }
}

The error carries everything you need: current utilization, the threshold, the model, token counts. No guessing.

When the block fires, toad-eye records it in three places: a span event in Jaeger, a counter metric in Grafana, and a warning in your application logs. One event, full visibility.

What to do when utilization is high

The metric and guard tell you there's a problem. Three practical fixes:

Summarize old turns. After N turns, replace the conversation history with an LLM-generated summary. Trade 50K tokens of history for a 2K summary. The agent loses some detail but stays under budget.

Compress tool results. Tool results are the biggest token hogs — a web search returning 10K tokens of HTML. Summarize tool results before adding them to context. Or store full results externally and put just a reference in context.

Route to a bigger model. When utilization crosses a threshold, switch models. Running on gpt-4o (128K)? Route to gpt-4.1 (1M) for the final turns:

if (err instanceof ToadContextExceededError && err.model === "gpt-4o") {
  return callWithModel("gpt-4.1", messages);
}

And sometimes the right answer is to stop. If your agent hasn't converged in 10 turns, adding 5 more turns of context won't help — it'll just cost more.

Where this came from

After publishing article #3 on OTel semantic conventions, a reader named @jidong left a comment:

"Context window usage per turn matters more than total tokens in agent loops."

They were right. Total tokens is a number. Context utilization is a trajectory. The first tells you what happened. The second tells you what's about to happen.

We built context_utilization the next week. Here's the tracking issue — shaped directly by that comment.

Quick checklist

If you're running agents in production:

Monitor input_tokens per turn, not just per session
Calculate context utilization: input_tokens / max_context_tokens
Alert when P95 utilization crosses 0.7
Guard at 80% (warn) and 95% (block)
Have a compression strategy ready before you hit the limit
Test with 10+ turn runs — the problem only shows up at scale

The metric is simple. The insight it gives you is not.

Previous articles:

toad-eye — open-source LLM observability, OTel-native: GitHub · npm

🐸👁️

Top comments (1)

John • May 19

This trajectory framing is the key. For local coding agents, I think the missing companion metric is live burn rate while the loop is still running: tokens, cost, reset window, and machine load before the bill arrives.