Per-agent daily spend limits: the architecture every AI team needs

#agents #ai #architecture #openai

Originally published at awx-shredder.fly.dev/blog

Per-agent daily spend limits: the architecture every AI team needs

Your Slack bot just burned through $847 in four hours because a junior dev accidentally pushed a loop that called gpt-4-turbo on every message edit event. Your customer support agent hit an infinite reasoning loop and racked up $2,300 in o1-preview costs before anyone noticed. These aren't hypothetical scenarios—they're the kind of incidents that happen weekly across AI engineering teams.

The problem isn't that developers are careless. It's that LLM APIs have fundamentally different cost characteristics than traditional APIs. A single malformed request can cost $50. A logic bug can drain thousands before your monitoring alerts even fire. And when you're running multiple agents—research bots, customer support, data analysis tools—the blast radius of a single misconfigured agent can take down your entire API budget.

Why application-level budget checks fail

Most teams start with application-level budget enforcement. You add a counter in your database, increment it on each API call, and check before making requests:

async def call_llm(agent_id: str, messages: list):
    current_spend = await db.get_daily_spend(agent_id)
    if current_spend >= DAILY_LIMIT:
        raise BudgetExceededError()

    response = await openai.chat.completions.create(
        model="gpt-4-turbo",
        messages=messages
    )

    # Calculate and record cost
    cost = calculate_cost(response.usage)
    await db.increment_spend(agent_id, cost)
    return response

This looks reasonable until you hit production. The cost calculation happens after the API call completes. If your database write fails, you've lost spend tracking. If the process crashes between the API call and the database update, that cost vanishes. Race conditions mean multiple requests can check the budget simultaneously, all see they're under the limit, and fire off requests that collectively exceed it.

More critically: this pattern requires every callsite in your codebase to route through your budget enforcement logic. Third-party libraries that call OpenAI directly bypass it entirely. That LangChain agent you integrated? It's not checking budgets. The new engineer who doesn't know about your internal wrapper? They import openai directly and circumvent everything.

The proxy architecture

The robust solution is budget enforcement at the network layer. Every LLM API call flows through a proxy that:

Authenticates the agent making the request
Checks current spend against the daily limit before forwarding to the LLM provider
Blocks the request immediately if the limit is exceeded
Records actual costs from the LLM response
Aggregates spend across all instances of your application

This architecture makes budget enforcement impossible to bypass. Applications can't accidentally route around it because the proxy is configured at the network level via OPENAI_BASE_URL. Multiple application instances automatically share the same spend tracking because it's centralized in the proxy.

Here's what the client-side configuration looks like:

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: process.env.OPENAI_BASE_URL, // points to proxy
  defaultHeaders: {
    'X-Agent-ID': 'customer-support-bot'
  }
});

// This call is budget-enforced automatically
const response = await client.chat.completions.create({
  model: 'gpt-4-turbo',
  messages: [{ role: 'user', content: 'Hello' }]
});

The proxy intercepts the request, checks if customer-support-bot has budget remaining today, and either forwards it to OpenAI or returns a 429 error. Your application code doesn't need to think about budgets—they're enforced infrastructure-level.

Building vs. buying

Implementing a production-grade proxy requires solving several non-trivial problems:

Streaming support: LLM streaming responses require careful proxy handling to calculate costs from partial responses
Token counting accuracy: Different models have different pricing for input/output tokens, and your cost calculations need to match OpenAI's billing exactly
Atomic spend updates: You need transactional guarantees that spend increments don't get lost
Multi-region deployment: Low latency requires running the proxy close to your application
Alert fatigue: Teams need warnings before hitting limits, not just hard blocks

For teams that need this now, AWX Shredder is a production-ready proxy that handles all of this. Change OPENAI_BASE_URL to https://awx-shredder.fly.dev/proxy/v1, set per-agent daily budgets, and get email alerts at 50%/80%/100% thresholds. It's OpenAI-compatible, so existing code works unchanged.

For teams building internally, the core architecture is straightforward:

Run a lightweight HTTP proxy (Node.js with http-proxy-middleware or Python with aiohttp)
Use Redis for atomic spend tracking with daily key expiration
Parse token usage from OpenAI responses and multiply by model-specific pricing
Return 429 errors when budgets are exceeded
Implement request signing or API keys to authenticate agents

The tricky parts are handling streaming correctly (you need to buffer the response to extract token counts while still streaming to the client) and keeping your pricing table up to date as OpenAI changes model costs.

The enforcement guarantee

The key insight is that budget enforcement must happen before cost is incurred, not after. Application-level tracking is audit logging. Proxy-level blocking is actual enforcement.

When your proxy returns 429, that request never reaches OpenAI. No tokens are consumed. No cost is charged. The agent is hard-stopped until the daily limit resets. This guarantee—that exceeding a budget is architecturally impossible—is what lets you safely increase agent autonomy without fear of runaway costs.

What to do today

If you're running multiple AI agents in production, implement per-agent spend limits this week. The next production incident will happen—the question is whether it costs $50 or $5,000. Pick a proxy architecture (build or buy), assign realistic daily budgets to each agent (10-20% above their typical daily spend), and configure alerts before you hit limits. Your infrastructure should make expensive mistakes impossible, not just unlikely.