Mukunda Rao Katta

Posted on May 25

agenttrace-rs: Group LLM Calls into Named Runs and Get Cost Breakdowns in Rust

#hermeschallenge #ai #rust #agents

The $180 mystery

Ran 200 eval cases overnight. Woke up to an $180 bill.

No breakdown by eval category. No idea which categories drove cost. The dashboard showed a total. The code had no instrumentation beyond a log line per call. To figure out which eval category was expensive, I would have had to re-run with manual logging added everywhere.

That is when I wrote agenttrace-rs.

Shape of the fix

Add it to Cargo.toml:

[dependencies]
agenttrace-rs = "0.1"
claude-cost = "0.1"

Wrap your LLM calls with a RunTracer:

use agenttrace_rs::{RunTracer, CallRecord};

let mut tracer = RunTracer::new();

// Before the call, note the start time.
let start = std::time::Instant::now();

// Make your LLM call however you normally do.
let response = my_llm_client.complete(&prompt).await?;

// Record it.
tracer.record(CallRecord {
    run_id: "eval:triage:batch-42".to_string(),
    model: "claude-sonnet-4-7".to_string(),
    input_tokens: response.usage.input_tokens,
    output_tokens: response.usage.output_tokens,
    cache_creation_tokens: response.usage.cache_creation_input_tokens.unwrap_or(0),
    cache_read_tokens: response.usage.cache_read_input_tokens.unwrap_or(0),
    latency_ms: start.elapsed().as_millis() as u64,
    cost_usd: None, // or pass a pre-computed value
});

At the end of the run, get a report:

let report = tracer.report("eval:triage:batch-42");

println!("total cost: ${:.4}", report.total_cost_usd);
println!("p50 latency: {}ms", report.p50_latency_ms);
println!("p95 latency: {}ms", report.p95_latency_ms);

for (model, stats) in &report.by_model {
    println!("  {}: {} calls, ${:.4}", model, stats.call_count, stats.cost_usd);
}

Or get reports for all runs at once:

let all_reports = tracer.all_reports();
for (run_id, report) in &all_reports {
    println!("{}: {} calls, ${:.4}", run_id, report.call_count, report.total_cost_usd);
}

What it does NOT do

No HTTP calls. No external service. Everything is in process.
No persistent storage. If your process exits, the data is gone. This is intentional; plug in your own store if you need persistence.
No automatic cost calculation. You can pass cost_usd in the CallRecord, or integrate with claude-cost / bedrock-cost to compute it before recording.
No distributed tracing. This is single-process instrumentation. For distributed systems, emit the records to your existing trace backend.

Inside the lib

The run ID is a caller-provided string. Not auto-generated. Not a UUID.

This was a deliberate call. When the run ID is "eval:triage:batch-42", the report label is immediately readable. You can split on : to group by category in post-processing. You can grep logs. You can match it to a CI job name.

If it were a UUID, every analysis step would require a separate metadata lookup to translate the UUID back to something meaningful. That lookup is not always available after the fact.

The tradeoff: you have to pass in something sensible. If two callers accidentally use the same run ID, their records will be merged. That is a real risk. The alternative would be to auto-generate IDs and force you to carry them. For eval pipelines where run identity is already known from the caller context, the string approach wins.

The latency percentiles are computed over the full recorded set for that run ID. p50 and p95 use a sort-and-index approach. For large call counts, this is fine. For very large runs, you may want reservoir sampling; that is out of scope for v0.1.

Composing with claude-cost:

use claude_cost::{compute_cost, ModelId};

let cost = compute_cost(
    ModelId::ClaudeSonnet47,
    response.usage.input_tokens,
    response.usage.output_tokens,
    response.usage.cache_creation_input_tokens.unwrap_or(0),
    response.usage.cache_read_input_tokens.unwrap_or(0),
)?;

tracer.record(CallRecord {
    run_id: "eval:triage:batch-42".to_string(),
    model: "claude-sonnet-4-7".to_string(),
    input_tokens: response.usage.input_tokens,
    output_tokens: response.usage.output_tokens,
    cache_creation_tokens: response.usage.cache_creation_input_tokens.unwrap_or(0),
    cache_read_tokens: response.usage.cache_read_input_tokens.unwrap_or(0),
    latency_ms: start.elapsed().as_millis() as u64,
    cost_usd: Some(cost.total_usd),
});

When useful

Eval pipelines where you run many cases and want a cost breakdown by category or batch.
Multi-model experiments where you want to compare cost and latency across model variants.
Agent loops where individual tool calls use LLMs and you want to see which tools are expensive.
Any overnight job where you need a cost report waiting for you in the morning rather than a surprise bill.

When NOT

If you need persistent traces across process restarts, add a serialization layer and write to disk or a database.
If you are already using OpenTelemetry or a distributed tracing backend, emit records there directly. Do not add a second in-process aggregation layer.
If you only run a handful of LLM calls per request and the cost is predictable, this is overkill. Use it when you have many calls and need aggregate visibility.

Install

[dependencies]
agenttrace-rs = "0.1"

Crates.io: agenttrace-rs
GitHub: MukundaKatta/agenttrace-rs

Siblings

Lib	Boundary	Repo
claude-cost	Compute cost from token counts for Anthropic + Bedrock	MukundaKatta/claude-cost
bedrock-cost	Same, cross-vendor Bedrock pricing	MukundaKatta/bedrock-cost
cachebench	Measure prompt cache hit rates	MukundaKatta/cachebench
llm-budget-window	Time-windowed token/USD budget caps	MukundaKatta/llm-budget-window
token-budget-pool	Shared concurrent token/USD budget pool	MukundaKatta/token-budget-pool

What is next

The main thing missing in v0.1 is a way to export records as structured JSON for offline analysis. Right now you get the RunReport struct in memory. A JSONL writer that appends one record per call would close the loop for persistent overnight jobs.

A reservoir sampling option for p50/p95 on very large runs would also be worth adding if the eval scale grows past a few thousand calls per run.

Part of the Hermes Agent Challenge sprint. All crates shipped on crates.io.

DEV Community