40 cents a day, three weeks of corrupted writes, zero alerts fired

#devops #opentelemetry #agentops #llm

The cron had been running for three weeks when they noticed it. Forty cents a day. Nothing in the cost dashboard looked wrong — spend was flat, well below any alert threshold. What the dashboard couldn't see: the cron had been corrupting writes the whole time. The cleanup took longer than three weeks. The cleanup cost more than the compute bill ever would have.

That's not a budget problem. The money wasn't the damage. The damage was invisible because the tooling could only answer one question — how much — and never the adjacent question that actually matters: what was the agent doing, was it authorized to do it, and how would you know if it stopped doing it correctly.

Timur put the root cause precisely last week: "session grain broke after the third nested agent. ended up tagging each span with a custom session_id + agent_depth attribute and aggregating in ClickHouse. the OTel LLM semantic conventions don't model agent trees well yet — it's flat calls all the way down."

That's the schema gap. The OpenTelemetry LLM semantic conventions were designed for the same world that gave us service meshes: flat microservice calls, one hop at a time, trace the hop. An agent tree is structurally different. An orchestrating agent spawns a sub-agent, which spawns another, which loops until it hits a ceiling or runs out of budget. The span model has no native concept of session (a bounded unit of agent work), agent depth (where in the tree is this span?), or pre-commit ceiling (was this span authorized before it ran?). When session grain breaks, you get the invoice. You do not get the explanation.

Three things have come up consistently, across the teams I've talked to, as the minimum instrumentation to close this gap:

1. Pre-commit ceiling

Before any agent invocation, check current session spend against a budget ceiling. If above threshold: block, or require explicit approval. This fires before damage happens, not after.

def invoke_agent(session_id, agent_fn, *args):
    current_spend = get_session_spend(session_id)
    if current_spend >= SESSION_CEILING:
        raise CeilingError(
            f"Session {session_id} at {current_spend}, ceiling {SESSION_CEILING}"
        )
    return agent_fn(*args)

The ceiling has to be set at session initialization and enforced at every invocation. Storing it in a config file no one checks is reconciliation theatre — the invoice arrives and you go looking for the number.

2. Session and depth tagging

Every span needs two additional attributes: session_id (the bounded unit of work — one user request, one job, one run) and agent_depth (0 = orchestrator, 1 = first sub-agent, and so on). These two fields make the invoice legible. They are not in the OTel LLM semantic conventions today.

with tracer.start_as_current_span("agent.invoke") as span:
    span.set_attribute("session.id", session_id)
    span.set_attribute("agent.depth", depth)
    span.set_attribute("agent.parent_session", parent_session_id)
    result = agent_fn(*args)

Without session_id and agent_depth, you know the team spent $400. You don't know which session did it, which sub-agent was at depth 3 when it looped, or what the loop was actually trying to accomplish.

3. Audit trail

When a session closes, write a record: session_id, total tokens, total cost, depth_max, agent count, ceiling hits. One row per session. That row is the document your manager is looking for when the invoice arrives.

def close_session(session_id):
    record = {
        "session_id": session_id,
        "total_tokens": sum_tokens(session_id),
        "total_cost_usd": sum_cost(session_id),
        "depth_max": max_depth_reached(session_id),
        "agent_count": count_agents(session_id),
        "ceiling_hits": count_ceiling_hits(session_id),
    }
    write_session_ledger(record)

No new tooling required. Consistent instrumentation is the whole thing.

None of this is novel. The teams I've talked to figured it out. So did the team behind the $47K 11-day ping-pong incident. The pattern is the same because the gap is the same: the upstream spec doesn't model agent trees, so every team that hits a wall builds the same bridge from scratch, by hand, during an incident, after the bill lands.

When OTel adds session_id, agent_depth, and a ceiling convention to the LLM semantic conventions, every framework that implements OTel gets this for free. Until then, the bridge is DIY.

If you have built this bridge — or are rebuilding it right now — DM me on X (@nathanielc85523). I'm mapping these workarounds to understand what a standard should actually say.

DEV Community

40 cents a day, three weeks of corrupted writes, zero alerts fired

Top comments (0)