The invoice that surprised everyone
Our team deployed an agent in a small pilot. A few hundred users. The agent ran multiple LLM steps per session: classify the incoming message, retrieve context, generate a reply, then do a quality check pass on the reply. Four model calls per session on average, sometimes more.
The bill at the end of the first month was four times what we had estimated. We went back through the logs. The quality check step was using the most expensive model we had access to, the same one we used for the generation step, because someone had copy-pasted the configuration when they added the quality check and never updated the model name. The quality check was doing simple binary scoring. It did not need a heavy model. A small model would have done the same job at roughly one-tenth the cost. We had made that decision during development, forgotten to verify it was in the config, and never caught it because nobody was tracking per-step costs.
The second thing we found: the context retrieval step was taking three to four seconds on average. End users were experiencing that as noticeable lag between sending a message and getting a reply. We had not noticed because the aggregate p95 latency for the whole agent looked tolerable. The slow step was diluted inside the total. It only became visible when we broke latency down per step.
Both of these were problems that per-step measurement would have surfaced immediately. Getting that measurement set up usually means either reaching for a full observability platform (reasonable for production, but heavy when you are still in development) or scattering print statements through the code (which you will forget to remove and which do not aggregate). agenttrace is the middle option: a thin wrapper around individual LLM calls that groups them into named steps, tallies cost and latency per step, and gives you a report at the end of a run.
Shape of the fix
import { AgentTrace } from "@mukundakatta/agenttrace";
const trace = new AgentTrace();
// Wrap each LLM call with a step name:
const classifyResult = await trace.record("classify", () =>
anthropic.messages.create({
model: "claude-haiku-4-5",
messages: classifyMessages,
max_tokens: 256,
})
);
const summarizeResult = await trace.record("summarize", () =>
anthropic.messages.create({
model: "claude-sonnet-4-6",
messages: summarizeMessages,
max_tokens: 1024,
})
);
// Get the report at the end of the run:
const report = trace.report();
console.log(report);
// {
// totalCostUsd: 0.0042,
// totalCalls: 2,
// p50LatencyMs: 812,
// p95LatencyMs: 1940,
// byStep: {
// classify: { calls: 1, costUsd: 0.0003, p50Ms: 320 },
// summarize: { calls: 1, costUsd: 0.0039, p50Ms: 1304 },
// },
// byModel: {
// "claude-haiku-4-5": { calls: 1, costUsd: 0.0003 },
// "claude-sonnet-4-6": { calls: 1, costUsd: 0.0039 },
// }
// }
trace.record(name, fn) wraps any async function that returns an Anthropic or OpenAI response object. It measures start and end time, extracts token counts from the response's usage field, looks up the per-token price for the model, and stores the result under the step name you provided. The return value of trace.record is the original response object, unchanged. Your downstream code does not need to change.
What it does NOT do
agenttrace is not a distributed tracing system. It does not send data anywhere. It does not integrate with Datadog, Honeycomb, OpenTelemetry, or any other observability backend out of the box. It is in-process only. Every AgentTrace instance accumulates data for the calls that went through it. If your agent runs across multiple processes or machines, each instance is independent. There is no automatic aggregation across instances.
It also does not capture request or response payloads. It captures metadata: step name, model identifier, input token count, output token count, latency in milliseconds, and computed cost. If you need to log the actual prompts and responses for debugging, wire that separately alongside the trace records.
The pricing data is a bundled snapshot. Model prices change. The library ships with a pricing table for common Anthropic and OpenAI models, but if a provider changes their prices after a given version of agentrace ships, you will see stale cost figures unless you update the library or pass custom prices. There is a prices constructor option for overriding built-in prices per model, which is the recommended path for production deployments where cost accuracy matters.
Inside the lib
Each call to trace.record returns the original function's return value unchanged. The wrapper is transparent. The result you get back is exactly what your LLM client returned. Nothing is added to or removed from the response. The trace data is accumulated internally in the AgentTrace instance and only surfaced when you call report() or reset().
The latency percentiles in report() are computed over all calls grouped by step name. If you call a step once, p50 and p95 are the same number, which is the latency of that single call. Percentiles become meaningful when you accumulate multiple calls under the same step name, either within a single agent run that calls a step multiple times, or by reusing the same AgentTrace instance across multiple runs.
The cost calculation uses the token counts from the response's usage object. For Anthropic that is input_tokens and output_tokens. For OpenAI that is prompt_tokens and completion_tokens. The pricing table maps model string identifiers to input and output price per million tokens. For models not in the table, cost is stored as null rather than zero. This distinction matters: null means "we do not have pricing for this model," not "this call was free."
Call trace.reset() to clear accumulated data between runs if you are reusing the same instance across multiple agent sessions. Or create a fresh new AgentTrace() per run. Both work identically.
When useful
- You want to know how much each step in your agent pipeline costs without setting up a full observability stack
- You are profiling agent latency to find which specific step is slow
- You are comparing two agent configurations and want a cost and speed breakdown per step to help decide which to ship
- You want a cost and latency summary in your logs or test output at the end of each agent run
When not useful
- You need distributed tracing that spans multiple services or machines
- You need request and response payload capture for compliance or debugging
- You need real-time cost alerting or dashboards
- Your agent uses a model provider not in the built-in pricing table and you cannot find the per-token prices to pass as a custom override
Install
agenttrace is currently available on GitHub at MukundaKatta/agenttrace. npm publication is in progress.
# Install from GitHub (current):
npm install github:MukundaKatta/agenttrace
# npm (coming soon):
# npm install @mukundakatta/agenttrace
Requires Node 18+. Zero runtime dependencies. Bring your own Anthropic or OpenAI client.
Siblings
| Library | What it does | Registry |
|---|---|---|
| agenttrace-rs | Rust port of the same cost and latency tracking concept | crates.io |
| @mukundakatta/agentsnap | Snapshot tests for agent tool call sequences | npm |
| cachebench | Prompt-cache observability for Anthropic API calls | PyPI |
| llm-budget-window | Time-windowed token and USD budget enforcement | crates.io |
| agent-event-bus | In-process pub/sub for agent lifecycle events | crates.io / PyPI |
What is next
The most immediate gap is OpenTelemetry export. An exportToOtel(exporter) method would let you push trace data to any OTel-compatible backend (Jaeger, Honeycomb, Grafana, etc.) without changing how you record individual calls. This would make agenttrace useful in production environments that already have an observability stack, not just during development where printing a report to the console is enough.
Persistent accumulation across runs is the other piece. Writing trace data to a JSONL file means you can track cost trends over time, or diff cost profiles between two versions of the agent, without standing up a database. That feature is straightforward to add and would make the tool useful for cost regression tracking between deploys.
Top comments (0)