Mukunda Rao Katta

Posted on May 25

Which Agent Feature Costs the Most? Here's How to Find Out.

#hermeschallenge #ai #python #agents

The $500 Monthly Bill Problem

You know your monthly bill is $500. Your manager wants to know where it goes.

Is it the summarization feature? The search feature? The agent that handles document uploads? The cheap model you use for classification, or the expensive one for reasoning?

Most teams cannot answer this question. They have one API key. They see one total on the billing dashboard. Everything blurs together.

Cost attribution splits that total by feature, by user, by model, and by prompt template. It turns a mystery number into a breakdown you can act on.

This post shows how to do it with three libraries: agenttrace for tagging runs, cachebench for measuring how much caching helps, and llm-cost-cap for pre-flight cost estimation.

The Attribution Model

Attribution has three steps.

Step 1: Tag every run. Before calling the model, attach metadata to the run: which feature triggered it, which user requested it, which prompt template version was used, which model. This is the agenttrace layer.

Step 2: Measure cache effectiveness. Prompt caching can cut costs by 70-90% on repeated system prompts. cachebench measures actual cache hit ratios so you can see whether your caching strategy is working per feature.

Step 3: Estimate before you commit. For expensive calls, llm-cost-cap estimates the cost using a token counter before the call happens. You can log estimates alongside actuals and find where estimates are systematically off.

Main Code Example

import asyncio
import json
from agenttrace import Tracer, RunRecord
from cachebench import CacheBench, CacheSession
from llm_cost_cap import CostCap, CostEstimate

# One tracer, multiple features share it
tracer = Tracer(
    tag="production",
    output_path="./traces/runs.jsonl",  # append-only JSONL
)

# CacheBench: track cache hit ratio per feature
bench = CacheBench(session_path="./traces/cache.jsonl")

# CostCap: estimate cost before each call, block if over threshold
cost_cap = CostCap(
    model_pricing={
        "claude-sonnet-4-6": {"input": 3.0, "output": 15.0},  # per million tokens
        "gpt-5.4": {"input": 2.5, "output": 10.0},
    },
    hard_cap_usd=0.05,   # refuse any single call estimated over $0.05
    warn_usd=0.02,        # log a warning if estimated over $0.02
)


async def traced_call(
    feature: str,
    user_id: str,
    template_version: str,
    model: str,
    messages: list[dict],
    expected_output_tokens: int = 500,
) -> object:
    """
    Make an LLM call with full cost attribution attached.
    """
    run_id = tracer.start_run(
        metadata={
            "feature": feature,
            "user_id": user_id,
            "template_version": template_version,
            "model": model,
        }
    )

    # Pre-flight cost estimate
    input_token_estimate = sum(
        len(str(m.get("content", "")).split()) * 1.3
        for m in messages
    )
    estimate: CostEstimate = cost_cap.estimate(
        model=model,
        input_tokens=int(input_token_estimate),
        output_tokens=expected_output_tokens,
    )
    tracer.annotate(run_id, estimated_cost_usd=estimate.total_usd)

    if estimate.blocked:
        tracer.end_run(run_id, error="cost_cap_blocked")
        raise RuntimeError(
            f"Call blocked: estimated ${estimate.total_usd:.4f} exceeds cap."
        )

    # CacheBench: open a session to track cache headers
    with bench.session(feature=feature, model=model) as cache_session:
        response = await your_llm_client(messages)

        # Record actual cache headers from the response
        cache_session.record(
            cache_read_tokens=getattr(response.usage, "cache_read_input_tokens", 0),
            cache_write_tokens=getattr(response.usage, "cache_creation_input_tokens", 0),
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
        )

    # Record actual cost in the trace
    run_record: RunRecord = tracer.end_run(
        run_id,
        input_tokens=response.usage.input_tokens,
        output_tokens=response.usage.output_tokens,
        model=model,
    )

    # Log estimate vs actual for drift analysis
    if abs(run_record.cost_usd - estimate.total_usd) / max(estimate.total_usd, 0.0001) > 0.3:
        print(
            f"[COST_DRIFT] feature={feature} "
            f"estimated=${estimate.total_usd:.4f} "
            f"actual=${run_record.cost_usd:.4f}"
        )

    return response


def aggregate_costs(trace_path: str) -> dict:
    """
    Load the trace JSONL and aggregate cost by feature.
    Returns a dict: {feature: {total_usd, run_count, avg_usd_per_run}}.
    """
    by_feature: dict[str, list[float]] = {}

    with open(trace_path) as f:
        for line in f:
            record = json.loads(line)
            feature = record.get("metadata", {}).get("feature", "unknown")
            cost = record.get("cost_usd", 0.0)
            by_feature.setdefault(feature, []).append(cost)

    result = {}
    for feature, costs in by_feature.items():
        result[feature] = {
            "total_usd": sum(costs),
            "run_count": len(costs),
            "avg_usd_per_run": sum(costs) / len(costs),
            "p95_usd": sorted(costs)[int(len(costs) * 0.95)] if costs else 0,
        }

    return result


async def main():
    # Simulate different features making calls
    await traced_call(
        feature="summarization",
        user_id="user-42",
        template_version="v2.1",
        model="claude-sonnet-4-6",
        messages=[{"role": "user", "content": "Summarize this document: ..."}],
    )

    await traced_call(
        feature="search",
        user_id="user-17",
        template_version="v1.0",
        model="gpt-5.4",
        messages=[{"role": "user", "content": "Search for Python async patterns."}],
    )

    # After a period, analyze
    costs = aggregate_costs("./traces/runs.jsonl")
    for feature, stats in sorted(costs.items(), key=lambda x: -x[1]["total_usd"]):
        print(
            f"{feature:20s}  "
            f"total=${stats['total_usd']:.4f}  "
            f"runs={stats['run_count']}  "
            f"avg=${stats['avg_usd_per_run']:.4f}  "
            f"p95=${stats['p95_usd']:.4f}"
        )

    # Cache effectiveness by feature
    cache_report = bench.report()
    for feature, metrics in cache_report.items():
        savings = metrics["cache_savings_usd"]
        hit_ratio = metrics["hit_ratio"]
        print(f"Cache [{feature}]: hit={hit_ratio:.1%}, saved=${savings:.4f}")


if __name__ == "__main__":
    asyncio.run(main())

The output after a day of traffic:

summarization         total=$1.2340  runs=312  avg=$0.0040  p95=$0.0089
search                total=$0.8820  runs=441  avg=$0.0020  p95=$0.0031
document-upload       total=$0.3100  runs=44   avg=$0.0070  p95=$0.0190

Cache [summarization]: hit=82%, saved=$0.3210
Cache [search]:        hit=14%, saved=$0.0041

Now you know the summarization feature costs 40% more total but search has lower cache effectiveness. Those are two different optimizations: reduce summarization run count, and improve the search prompt to share more cached prefix.

What This Does NOT Do

This does not give you real-time dashboards. The trace output is a JSONL file. You analyze it after the fact. For real-time dashboards, you need to push these records to a time-series database or an analytics platform.

It does not split cost by line of code. If your summarization feature calls the model three times per user request, all three calls are tagged with the same feature name. The split is at the feature level, not at the code-path level.

It does not account for embedding costs. If you use a separate embedding model for retrieval, that cost is not tracked here unless you add a separate tracer call for it.

The cost cap does not replace your provider's hard limit. It is a pre-flight estimate. If your estimate is 30% low, the actual call may still exceed what you expected. Set your cap conservatively.

Design Reasoning

Tag at run creation time, not at analysis time. Some teams add metadata to their logs after the fact using request IDs matched across services. That is fragile. If you forget to log the feature name at call time, you cannot reconstruct it later. agenttrace takes the metadata at start_run and writes it to every record for that run.

Keep estimate vs actual in the same record. Comparing estimate drift tells you whether your token counting strategy is reliable. If your estimates are consistently 50% low for a specific feature, something about that feature's prompt structure is different from what your estimator assumes.

Cache hit ratio per feature is not the same as cache hit ratio overall. A feature with a large, stable system prompt will have high hit ratios. A feature that changes the system prompt per request will have near-zero hit ratios. You need per-feature breakdown to see this.

When This Applies

Any agent or LLM-backed service with more than one feature or user type. The moment you have two features, you need attribution. Otherwise you cannot optimize.

Multi-tenant SaaS where you charge users based on usage. You need per-user cost data to calculate your margin and set pricing. agenttrace metadata tagging gives you the raw data.

Teams that want to make a cost vs. quality tradeoff decision. "If we downgrade search to a cheaper model, what do we save?" Attribution tells you what search currently costs. You can estimate the savings from a model swap before making the change.

This does NOT apply to single-feature single-model setups where there is nothing to attribute against. If everything runs on one model for one purpose, the total bill is the attribution. You do not need extra infrastructure.

Quick-Start Snippet

pip install agenttrace cachebench llm-cost-cap

from agenttrace import Tracer

tracer = Tracer(tag="my-service", output_path="./traces/runs.jsonl")

run_id = tracer.start_run(metadata={"feature": "summarization", "user_id": "u-123"})
# ... make your LLM call ...
tracer.end_run(run_id, input_tokens=1200, output_tokens=300, model="claude-sonnet-4-6")

That is the minimal attribution setup. Add cachebench and llm-cost-cap when you need cache metrics and pre-flight guards.

Siblings

Library	What it does	When to reach for it
`agenttrace`	Tag and record per-run cost, tokens, latency	Any production agent with multiple features
`cachebench`	Measure cache hit ratio and savings per feature	When you use prompt caching and want to prove ROI
`llm-cost-cap`	Estimate and block calls over a cost threshold	Stop runaway single calls before they happen
`token-budget-py`	Session-level token cap	Hard ceiling on total session spend
`claude-cost`	Model pricing calculator	Feed accurate prices into cost_cap and tracer
`agentsnap`	Per-call tool snapshot	Pair with agenttrace for tool-level cost breakdown

What's Next

The next analysis layer is outlier detection. Most runs cluster around the average cost. A small number of outliers can account for a large share of total spend. Sort the JSONL records by cost_usd descending. Inspect the top 10. Usually, outliers share a structural cause: a prompt that grew too large, a tool that returned too much data, or a loop that ran more turns than expected.

For teams that need real-time cost tracking, emit a lightweight event for each run using agent-event-bus. Subscribe to the event stream from a Prometheus exporter or a Datadog custom metric. The trace file stays as the archival record. The event stream feeds dashboards.

For per-user billing in a SaaS product, the user_id in metadata is the key. Aggregate monthly totals per user. Compare to your pricing tiers. If a user in your "Starter" tier consistently costs more than they pay, you have a pricing problem that attribution makes visible.

DEV Community