George Belsky

Posted on Apr 2

Your AI Agent Spent $500 Overnight and Nobody Noticed

#ai #llm #monitoring #python

Friday 5 PM. You deploy a research agent that processes customer tickets. It calls GPT-4 for each one. Expected load: 200 tickets a day, about $8 in API costs.

Friday 11 PM. A bug in ticket deduplication. The agent reprocesses the same tickets in a loop. Each iteration makes 4 LLM calls at $0.03 each. The loop runs 50 times per hour.

Saturday 3 AM. The agent has made 12,000 LLM calls. Cost so far: $360. Nobody is watching.

Monday 9 AM. OpenAI billing alert fires at the $500 threshold you set months ago. Total damage: $487. No logs showing which agent caused it, which task triggered the loop, or when it started.

This is not hypothetical. Every team running AI agents in production has a version of this story.

Why Standard Monitoring Doesn't Help

OpenAI gives you total organization spend. Not per-agent. Not per-task. Not in real time.

If you have 5 agents calling GPT-4, and one goes haywire, your OpenAI dashboard shows a line going up. Which agent? You don't know. Which task caused the spike? You don't know. When did it start? You can guess from the slope of the graph.

Cloud monitoring (Datadog, Grafana) tracks CPU and memory. It doesn't know about LLM tokens. You could instrument it yourself - custom metrics, Prometheus counters, StatsD gauges - but now you're building a cost monitoring system instead of building your product.

Billing alerts are too late and too coarse. A $500 alert tells you the money is already gone. A per-API-key alert doesn't map to individual agents.

What you actually need:

Cost per agent, in real time
Cost per task (not just per agent)
Budget limits that actually stop the agent
Alerts before the damage is done

Tracking Cost Through the Agent Heartbeat

AXME agents send heartbeats every 30 seconds - standard health reporting. The insight is that cost is just another metric in that heartbeat.

from axme import AxmeClient, AxmeClientConfig

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

def call_llm(prompt: str) -> str:
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
    )

    # Calculate cost from token usage
    tokens_in = response.usage.prompt_tokens
    tokens_out = response.usage.completion_tokens
    cost_usd = (tokens_in * 0.03 + tokens_out * 0.06) / 1000

    # Report cost alongside the regular heartbeat
    client.mesh.report_metric(
        agent="agent://myorg/production/research-agent",
        intent_id=current_intent_id,
        cost_usd=cost_usd,
        metadata={"model": "gpt-4", "tokens_in": tokens_in, "tokens_out": tokens_out},
    )

    return response.choices[0].message.content

Two lines of actual logic: calculate the cost, report it. The gateway accumulates it per agent, per intent, per time window. No Prometheus setup. No custom Datadog metrics. No StatsD.

Budget Limits That Actually Stop Agents

Reporting cost is useful. Enforcing limits is essential.

client.mesh.set_cost_policy(
    agent="agent://myorg/production/research-agent",
    rules=[
        {
            "period": "day",
            "limit_usd": 50.00,
            "action": "pause",
            "notify": ["ops@company.com"],
        },
        {
            "period": "intent",
            "limit_usd": 5.00,
            "action": "alert",
            "notify": ["ops@company.com"],
        },
    ],
)

Three actions: alert sends a notification and the agent continues. pause stops delivering new intents to the agent. kill terminates active intents immediately.

The gateway checks policies on every heartbeat. When the research agent hits $50 for the day, it pauses. Not after $500. Not after the invoice. At $50, in real time.

The per-intent limit catches the more subtle problem. If a single customer ticket costs $5 in LLM calls, something is wrong with that ticket or with how the agent is processing it. Alert before it becomes a pattern.

What the Dashboard Shows

The AXME mesh dashboard shows cost alongside agent status:

+---------------------+----------+----------+-----------+--------+
| Agent               | Today    | This Wk  | This Mo   | Status |
+---------------------+----------+----------+-----------+--------+
| research-agent      | $12.47   | $47.82   | $189.30   | OK     |
| support-agent       | $8.23    | $31.05   | $142.18   | OK     |
| onboarding-agent    | $3.10    | $14.22   | $58.91    | OK     |
| data-pipeline       | $0.00   | $22.40   | $89.60    | PAUSED |
+---------------------+----------+----------+-----------+--------+

Drill into any agent and you see cost per intent, per model, per hour. The data-pipeline agent is paused because it hit its daily limit yesterday. No surprises.

Multi-Model Visibility

Most agents don't use just one model. They use GPT-4 for complex reasoning, GPT-4o for routine tasks, GPT-4o-mini for classification. Each has different pricing. Each needs separate tracking.

MODEL_COSTS = {
    "gpt-4":       {"input": 0.03,  "output": 0.06},
    "gpt-4o":      {"input": 0.005, "output": 0.015},
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
}

def call_llm(model: str, prompt: str) -> str:
    response = openai.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    rates = MODEL_COSTS[model]
    cost = (
        response.usage.prompt_tokens * rates["input"]
        + response.usage.completion_tokens * rates["output"]
    ) / 1000

    client.mesh.report_metric(
        cost_usd=cost,
        intent_id=current_intent_id,
        metadata={"model": model},
    )
    return response.choices[0].message.content

The dashboard breaks it down:

research-agent cost breakdown (today):
  gpt-4:          $9.20  (73.8%)
  gpt-4o:         $2.85  (22.8%)
  gpt-4o-mini:    $0.42  (3.4%)

Now you can ask the right question: can we move some of those GPT-4 calls to GPT-4o-mini and cut costs by 60%?

The Alternative

Without this, you build it yourself:

Instrument every LLM call with token counting
Send custom metrics to Prometheus/Datadog/CloudWatch
Build dashboards per agent (Grafana? Retool? Custom?)
Write alerting rules with the right thresholds
Build the "pause agent" mechanism yourself
Map OpenAI costs to individual agents in your billing system
Maintain all of this as models and pricing change

That is a real project. Weeks of work. And it is not your product.

Or: report cost_usd in the heartbeat your agent already sends. Set a policy. Done.

Try It

Working example with cost reporting, budget limits, and multi-model tracking:

github.com/AxmeAI/ai-agent-cost-monitoring

Built with AXME - agent coordination with durable lifecycle and cost controls. Alpha - feedback welcome.

DEV Community