Originally published at awx-shredder.fly.dev/blog
Per-agent daily spend limits: the architecture every AI team needs
Your Slack bot just burned through $847 in four hours because a junior dev accidentally pushed a loop that called gpt-4-turbo on every message edit event. Your customer support agent hit an infinite reasoning loop and racked up $2,300 in o1-preview costs before anyone noticed. These aren't hypothetical scenarios—they're the kind of incidents that happen weekly across AI engineering teams.
The problem isn't that developers are careless. It's that LLM APIs have fundamentally different cost characteristics than traditional APIs. A single malformed request can cost $50. A logic bug can drain thousands before your monitoring alerts even fire. And when you're running multiple agents—research bots, customer support, data analysis tools—the blast radius of a single misconfigured agent can take down your entire API budget.
Why application-level budget checks fail
Most teams start with application-level budget enforcement. You add a counter in your database, increment it on each API call, and check before making requests:
async def call_llm(agent_id: str, messages: list):
current_spend = await db.get_daily_spend(agent_id)
if current_spend >= DAILY_LIMIT:
raise BudgetExceededError()
response = await openai.chat.completions.create(
model="gpt-4-turbo",
messages=messages
)
# Calculate and record cost
cost = calculate_cost(response.usage)
await db.increment_spend(agent_id, cost)
return response
This looks reasonable until you hit production. The cost calculation happens after the API call completes. If your database write fails, you've lost spend tracking. If the process crashes between the API call and the database update, that cost vanishes. Race conditions mean multiple requests can check the budget simultaneously, all see they're under the limit, and fire off requests that collectively exceed it.
More critically: this pattern requires every callsite in your codebase to route through your budget enforcement logic. Third-party libraries that call OpenAI directly bypass it entirely. That LangChain agent you integrated? It's not checking budgets. The new engineer who doesn't know about your internal wrapper? They import openai directly and circumvent everything.
The proxy architecture
The robust solution is budget enforcement at the network layer. Every LLM API call flows through a proxy that:
- Authenticates the agent making the request
- Checks current spend against the daily limit before forwarding to the LLM provider
- Blocks the request immediately if the limit is exceeded
- Records actual costs from the LLM response
- Aggregates spend across all instances of your application
This architecture makes budget enforcement impossible to bypass. Applications can't accidentally route around it because the proxy is configured at the network level via OPENAI_BASE_URL. Multiple application instances automatically share the same spend tracking because it's centralized in the proxy.
Here's what the client-side configuration looks like:
import OpenAI from 'openai';
const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
baseURL: process.env.OPENAI_BASE_URL, // points to proxy
defaultHeaders: {
'X-Agent-ID': 'customer-support-bot'
}
});
// This call is budget-enforced automatically
const response = await client.chat.completions.create({
model: 'gpt-4-turbo',
messages: [{ role: 'user', content: 'Hello' }]
});
The proxy intercepts the request, checks if customer-support-bot has budget remaining today, and either forwards it to OpenAI or returns a 429 error. Your application code doesn't need to think about budgets—they're enforced infrastructure-level.
Building vs. buying
Implementing a production-grade proxy requires solving several non-trivial problems:
- Streaming support: LLM streaming responses require careful proxy handling to calculate costs from partial responses
- Token counting accuracy: Different models have different pricing for input/output tokens, and your cost calculations need to match OpenAI's billing exactly
- Atomic spend updates: You need transactional guarantees that spend increments don't get lost
- Multi-region deployment: Low latency requires running the proxy close to your application
- Alert fatigue: Teams need warnings before hitting limits, not just hard blocks
For teams that need this now, AWX Shredder is a production-ready proxy that handles all of this. Change OPENAI_BASE_URL to https://awx-shredder.fly.dev/proxy/v1, set per-agent daily budgets, and get email alerts at 50%/80%/100% thresholds. It's OpenAI-compatible, so existing code works unchanged.
For teams building internally, the core architecture is straightforward:
- Run a lightweight HTTP proxy (Node.js with
http-proxy-middlewareor Python withaiohttp) - Use Redis for atomic spend tracking with daily key expiration
- Parse token usage from OpenAI responses and multiply by model-specific pricing
- Return 429 errors when budgets are exceeded
- Implement request signing or API keys to authenticate agents
The tricky parts are handling streaming correctly (you need to buffer the response to extract token counts while still streaming to the client) and keeping your pricing table up to date as OpenAI changes model costs.
The enforcement guarantee
The key insight is that budget enforcement must happen before cost is incurred, not after. Application-level tracking is audit logging. Proxy-level blocking is actual enforcement.
When your proxy returns 429, that request never reaches OpenAI. No tokens are consumed. No cost is charged. The agent is hard-stopped until the daily limit resets. This guarantee—that exceeding a budget is architecturally impossible—is what lets you safely increase agent autonomy without fear of runaway costs.
What to do today
If you're running multiple AI agents in production, implement per-agent spend limits this week. The next production incident will happen—the question is whether it costs $50 or $5,000. Pick a proxy architecture (build or buy), assign realistic daily budgets to each agent (10-20% above their typical daily spend), and configure alerts before you hit limits. Your infrastructure should make expensive mistakes impossible, not just unlikely.
Top comments (0)