DEV Community

AwxGlobal
AwxGlobal

Posted on • Originally published at awx-shredder.fly.dev

Per-Agent Cost Tracking: Why Your LLM Analytics Are Probably Wrong

Originally published at awx-shredder.fly.dev/blog

Per-Agent Cost Tracking: Why Your LLM Analytics Are Probably Wrong

Your CFO asks why the OpenAI bill jumped 340% last month. You open your dashboard and see... one giant line item. Maybe you've got it broken down by API key, but three teams share the same key. You know the costs are coming from your agent fleet, but which agents? Which conversations? Which models are burning through your budget at 3 AM?

This isn't a hypothetical. I've watched teams spend $40K/month on GPT-4 calls before realizing that a single misconfigured agent was responsible for 60% of it. The agent was retrying failed calls with exponential backoff but no maximum—running for days on edge cases.

The Three Dimensions That Actually Matter

Cost visibility for AI APIs needs three dimensions simultaneously: who (agent/user), what (model/operation), and when (time-series data). Miss any one and you're flying blind.

Per-agent tracking tells you which parts of your system are expensive. Your customer support bot might cost $0.03 per conversation while your code review agent costs $2.40. Without this, you're optimizing in the dark.

Per-model tracking reveals substitution opportunities. If 80% of your GPT-4 calls could use GPT-4-mini, that's a 97% cost reduction on that subset.

Per-hour tracking catches runaway processes and usage patterns. That spike at 2 AM? A batch job that should've been rate-limited. The gradual climb every afternoon? User traffic you can predict and budget for.

Building Cost Tracking Into Your Middleware

The right place to track costs is in your API middleware layer, not scattered across services. Here's a Python example using a decorator pattern that captures all three dimensions:

import time
import logging
from functools import wraps
from datetime import datetime
from openai import OpenAI

class CostTracker:
    def __init__(self):
        self.costs = []
        # Pricing per 1M tokens (update these regularly)
        self.pricing = {
            'gpt-4-turbo': {'input': 10.0, 'output': 30.0},
            'gpt-4': {'input': 30.0, 'output': 60.0},
            'gpt-3.5-turbo': {'input': 0.5, 'output': 1.5},
        }

    def calculate_cost(self, model, input_tokens, output_tokens):
        if model not in self.pricing:
            logging.warning(f"Unknown model {model}, cost not tracked")
            return 0

        input_cost = (input_tokens / 1_000_000) * self.pricing[model]['input']
        output_cost = (output_tokens / 1_000_000) * self.pricing[model]['output']
        return input_cost + output_cost

    def track_call(self, agent_id, model, input_tokens, output_tokens, timestamp):
        cost = self.calculate_cost(model, input_tokens, output_tokens)
        self.costs.append({
            'agent_id': agent_id,
            'model': model,
            'input_tokens': input_tokens,
            'output_tokens': output_tokens,
            'cost': cost,
            'timestamp': timestamp,
            'hour': timestamp.replace(minute=0, second=0, microsecond=0)
        })
        return cost

def track_llm_call(agent_id):
    tracker = CostTracker()

    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            start_time = datetime.utcnow()
            result = func(*args, **kwargs)

            # Extract token usage from response
            usage = result.usage
            model = result.model

            cost = tracker.track_call(
                agent_id=agent_id,
                model=model,
                input_tokens=usage.prompt_tokens,
                output_tokens=usage.completion_tokens,
                timestamp=start_time
            )

            logging.info(f"Agent {agent_id} | {model} | ${cost:.4f}")
            return result
        return wrapper
    return decorator

# Usage
@track_llm_call(agent_id='customer-support-v2')
def call_gpt(messages):
    client = OpenAI()
    return client.chat.completions.create(
        model='gpt-4-turbo',
        messages=messages
    )
Enter fullscreen mode Exit fullscreen mode

This gives you structured logs you can pipe to your analytics stack. But there's a gap: this tracks costs after they happen. You need real-time budgets to prevent runaway spending.

Real-Time Budget Enforcement

Logging is reactive. You need proactive budget enforcement—hard limits that block requests before they hit your bill. This requires a stateful proxy that:

  1. Intercepts all LLM API calls
  2. Tracks cumulative spend per agent in real-time
  3. Blocks requests that would exceed budget
  4. Alerts before you hit limits

For production use, AWX Shredder solves this specific problem as a drop-in OpenAI-compatible proxy. Set OPENAI_BASE_URL=https://awx-shredder.fly.dev/proxy/v1, configure per-agent daily budgets, and it hard-blocks calls the moment an agent exceeds its limit. You get real-time spend tracking, email alerts at 50%/80%/100%, and a dashboard—without changing your application code beyond the base URL.

What to Alert On

Once you have visibility, alert on these signals:

  • Hourly spend rate exceeding daily budget pro-rata: If you've spent 30% of your daily budget in the first 2 hours (8.3% of the day), something's wrong.
  • Per-agent spend deviating >3σ from baseline: Catches both runaway agents and sudden drops (which might indicate broken functionality).
  • Model selection anomalies: If an agent that normally uses GPT-3.5 suddenly switches to GPT-4, investigate.
  • Token efficiency degradation: Track cost-per-successful-operation. If your code review agent's cost-per-review doubles, your prompts might be degrading.

The Data Schema That Scales

Store cost data in a time-series database with this schema:

timestamp | agent_id | model | input_tokens | output_tokens | cost | metadata
Enter fullscreen mode Exit fullscreen mode

The metadata JSON field is crucial—store request IDs, user IDs, feature flags, anything you might need to slice by later. You can't predict what questions your CFO will ask.

Partition by day and agent_id for query performance. Your most common query will be "show me the last 7 days for agent X."

Start Today

If you're not tracking per-agent costs right now, add the decorator pattern above to your highest-spend endpoint this afternoon. Point it at CloudWatch, Datadog, or even a Postgres table. You'll have visibility within hours, not sprints.

Then ask: which agent cost the most yesterday? You probably don't know the answer. By tomorrow, you should.

Top comments (0)