Mukunda Rao Katta

Posted on May 21

A 3-step agent cost me $4.20. agenttrace showed me the O(n ) tool call hiding in plain sight.

#hermesagent #ai #llm #observability

I ran a small agent. Three steps. One web search, one summarize, one cite-check. I had budgeted maybe 12 cents.

The bill at the end of the run was $4.20.

I knew something was off but the per-call invoice line items were not telling me anything useful. They were just a list of messages.create calls. I needed to group them into the run that produced them and look at the cost shape.

That is the gap agenttrace-rs fills. It is a Rust crate that aggregates LLM calls into runs and gives you cost, latency, and a by-model breakdown.

The breakdown that surfaced the bug

use agenttrace::{Trace, Run};

let mut trace = Trace::new();

let run = trace.start_run("cite-check-agent");

run.record_call(claude_cost::estimate(&req1, &resp1));
run.record_call(claude_cost::estimate(&req2, &resp2));
run.record_call(claude_cost::estimate(&req3, &resp3));
// ... and so on for every tool result/follow-up step

let summary = run.finish();
println!("{}", summary.report());

The report it printed for the $4.20 run:

run: cite-check-agent  duration: 38.4s  total_cost_usd: 4.2031
calls: 11
p50_latency_ms: 2710
p95_latency_ms: 4920

by-model:
  claude-opus-4-7:    9 calls  $4.1880  avg_input_tok: 18,420  avg_output_tok: 540
  claude-haiku-4:     2 calls  $0.0151  avg_input_tok: 1,200   avg_output_tok: 180

by-step:
  step_1_search:       1 call   $0.0184  1,800 in   220 out
  step_2_summarize:    1 call   $0.0312  3,100 in   280 out
  step_3_cite_check:   9 calls  $4.1535  avg 22,400 in   avg 510 out

Step 3 was supposed to be one call. It was nine. And the average input tokens were 22,400. That is the smoking gun.

What was actually happening

The cite-check step had a tool the model could call to fetch a source URL. When the model called the tool, I appended the tool result to the messages list and re-called messages.create. Standard pattern.

What I missed: every iteration was re-attaching the full prior history including the search results from step 1 and the summary from step 2. So call 4 had everything from calls 1-3 in its input. Call 5 had everything from calls 1-4. And so on. Input tokens grew linearly per call, total tokens grew quadratically over the step.

The model kept calling the tool again because the prompt was structured ambiguously. So I had an unbounded loop hidden behind a 9-iteration tool dance. O(n²) input tokens for n iterations.

The fix was small. I stopped re-attaching the full history on each tool turn and used a sliding window. Re-ran the same run cold:

run: cite-check-agent  duration: 11.2s  total_cost_usd: 0.1432
calls: 5
p50_latency_ms: 2200
p95_latency_ms: 3050

by-model:
  claude-opus-4-7:    3 calls  $0.1290
  claude-haiku-4:     2 calls  $0.0142

by-step:
  step_1_search:      1 call   $0.0181
  step_2_summarize:   1 call   $0.0308
  step_3_cite_check:  3 calls  $0.0943

14 cents. About 30x cheaper. I would not have found the bug without the by-step grouping.

What agenttrace actually does

use agenttrace::{Trace, Tag};

let mut trace = Trace::new();
let run = trace.start_run("my-agent");

run.tag("user_id", "u_8821");
run.tag("step", "search");

// for each LLM call
run.record(agenttrace::CallRecord {
    model: "claude-opus-4-7".into(),
    input_tokens: 1800,
    output_tokens: 220,
    cache_read_tokens: 0,
    cache_write_tokens: 0,
    latency_ms: 2710,
    cost_usd: 0.0184,
    tags: vec![Tag::step("search")],
});

let summary = run.finish();
trace.append(summary);

// serialize all runs
let json = serde_json::to_string(&trace.runs())?;

It is a thin aggregator. It does not call the API. It does not make pricing decisions. You feed it call records (typically computed from claude-cost or your own pricing function) and it composes them into a run with cost, p50/p95, and per-tag breakdowns.

Why p95 matters more than mean

avg_latency_ms lies. A run with one slow call (the model thought for 12 seconds, the rest returned in 2) shows a mean of about 4 seconds. The p95 shows the actual tail. For agents this is the number that tells you whether your user-facing experience is going to feel snappy or laggy. agenttrace exposes p50, p95, and p99 by default.

Composing with other crates

claude-cost for the per-call cost estimate (cache-aware).
cachebench to see the cache hit ratio across the run.
llm-circuit-breaker to short-circuit a run when an upstream is degraded so you do not pay $4.20 to discover that.

A typical pipeline in our service looks like: cachebench records hit/miss → claude-cost computes cost given hits → agenttrace aggregates into a run summary.

What this does not solve

It does not store traces durably. Trace is in-memory. You serialize to disk or to a remote sink yourself. I do that with a one-line serde_json::to_writer to a sqlite blob.
It does not visualize. There is no UI. You get JSON or text reports. If you want a flamegraph, pipe to your own viewer.
It does not capture the request bodies. Pair with agenttap for that. agenttrace is the cost/latency layer, not the wire layer.
The tagging system is flat. There is no nested-span model. If you need that, OpenTelemetry is the right tool and otel-genai-bridge-rs can translate between conventions.

The crate is about 600 lines of pure Rust. No async lock-in.

Repo: https://github.com/MukundaKatta/agenttrace-rs
crates.io: agenttrace = { package = "agenttrace-rs", version = "0.1" }

Part of a small Rust stack I publish for AI agent plumbing: cost, retry, breakers, repair, trace. Built piece by piece from real incidents.

DEV Community