Alerting on LLM Cost Thresholds: When to Warn vs When to Hard-Block

#alerting #llm #cost #monitoring

Alerting on LLM Cost Thresholds: When to Warn vs When to Hard-Block

Last month, our AI-powered support agent racked up $4,800 in OpenAI charges over a weekend. A misconfigured retry loop hit GPT-4 with full conversation history on every attempt. The API never said no—it just kept billing us.

If you're running LLM agents in production, this nightmare scenario is closer than you think. The question isn't whether to set up cost alerts, but how to structure them so you catch problems early without killing legitimate usage.

The Three-Tier Alert Strategy

Most developers instinctively reach for a single budget threshold: "Alert me when we hit $X." But production systems need a graduated response that balances awareness, urgency, and damage control.

Here's what works:

50% threshold: Passive monitoring

Log the event and send a low-priority notification. This is your early warning system. Normal usage patterns should occasionally hit this mark—it means your budget is sized correctly. No action required, just awareness.

80% threshold: Active investigation

Page the on-call engineer or send a high-priority alert. Something unusual is happening. Maybe it's legitimate (unexpected traffic spike, complex queries), maybe it's a bug. You need human eyes on this before it becomes a crisis.

100% threshold: Hard block

Stop all API calls immediately. Yes, this will cause user-facing failures. But uncontrolled cost spirals are worse than temporary downtime. You can always manually override if the spending is legitimate.

Implementation: Building Your Own Cost Gate

The naive approach is to check spending after each API call:

import openai
import redis
from datetime import datetime

class CostGate:
    def __init__(self, redis_client, daily_budget_usd):
        self.redis = redis_client
        self.budget = daily_budget_usd

    def get_today_key(self, agent_id):
        date = datetime.utcnow().strftime("%Y-%m-%d")
        return f"llm_cost:{agent_id}:{date}"

    def record_cost(self, agent_id, tokens_used, cost_per_1k):
        key = self.get_today_key(agent_id)
        cost = (tokens_used / 1000) * cost_per_1k

        # Atomic increment
        total_cost = self.redis.incrbyfloat(key, cost)
        self.redis.expire(key, 86400 * 2)  # Keep 2 days

        usage_pct = (total_cost / self.budget) * 100

        if usage_pct >= 100:
            raise BudgetExceededError(f"Agent {agent_id} exceeded daily budget")
        elif usage_pct >= 80:
            self.alert_high_usage(agent_id, total_cost, usage_pct)
        elif usage_pct >= 50:
            self.log_milestone(agent_id, total_cost, usage_pct)

        return total_cost

    def check_budget_before_call(self, agent_id, estimated_tokens):
        key = self.get_today_key(agent_id)
        current_cost = float(self.redis.get(key) or 0)

        if current_cost >= self.budget:
            raise BudgetExceededError(
                f"Agent {agent_id} already at ${current_cost:.2f} of ${self.budget} budget"
            )

# Usage
gate = CostGate(redis_client, daily_budget_usd=100)

try:
    gate.check_budget_before_call(agent_id="support-bot", estimated_tokens=2000)

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=messages
    )

    gate.record_cost(
        agent_id="support-bot",
        tokens_used=response.usage.total_tokens,
        cost_per_1k=0.03
    )
except BudgetExceededError as e:
    # Log and fail gracefully
    return {"error": "Service temporarily unavailable"}

This works, but you're reinventing infrastructure. You need:

Accurate token counting across multiple models
Rate limiting logic
Alert delivery (email, Slack, PagerDuty)
Dashboard for visibility
Handling of streaming responses
Coordination across distributed services

When to Use a Proxy Layer

For production systems, wrapping your API client isn't enough. You need enforcement at the network level. If a rogue process bypasses your wrapper, you're back to unlimited billing.

This is where proxy solutions become essential. Tools like AWX Shredder (awx-shredder.fly.dev) sit between your application and OpenAI, enforcing budgets at the protocol level. Change OPENAI_BASE_URL and suddenly every API call—even from legacy code or third-party libraries—goes through the cost gate. It handles the 50%/80%/100% alert progression out of the box.

The key architectural benefit: your application code stays simple. No cost-tracking logic scattered across microservices. Budget enforcement becomes infrastructure, not application logic.

Sizing Your Thresholds

Daily budgets should be based on P95 usage, not average. If your agent typically costs $50/day but occasionally spikes to $120, a $100 daily budget will cause false positives.

Start with this formula:

daily_budget = (P95_daily_usage * 1.5) + expected_growth_buffer

For new agents without usage history, be conservative. A $20/day limit won't break the bank if something goes wrong, and you can increase it once you understand actual patterns.

The Hard-Block Decision

Some developers resist hard-blocking at 100%, fearing user impact. But consider the alternative: a coding error that burns through thousands of dollars before anyone notices.

Hard-blocking is the right default. For agents that genuinely need to exceed daily budgets, implement an explicit override mechanism:

def create_completion(messages, allow_budget_override=False):
    if allow_budget_override:
        # Use separate, monitored endpoint
        return openai.ChatCompletion.create(...)
    else:
        # Route through cost gate
        return gated_completion(messages)

Overrides should require a paper trail—log who authorized it and why.

What to Do Right Now

Calculate your current daily LLM spend per agent
Set up basic Redis-backed tracking with the code above (or deploy a proxy layer)
Configure 50%/80%/100% alerts with your actual thresholds
Test the hard-block by manually triggering it in staging

The first time that 100% threshold stops a runaway agent from emptying your AWS budget, you'll wonder how you ever ran LLMs without it.