Preventing CrewAI Budget Overruns: Hard Limits Per Agent Role

#crewai #agents #budget #production

Originally published at awx-shredder.fly.dev/blog

Preventing CrewAI Budget Overruns: Hard Limits Per Agent Role

A multi-agent CrewAI workflow spun up in production last month and burned through $340 in API costs before anyone noticed. The culprit? A research agent stuck in a loop, making hundreds of GPT-4 calls to refine a single report. The agent eventually completed its task, but the invoice was brutal.

CrewAI's built-in max_rpm and max_iter parameters help, but they're blunt instruments. RPM limits don't account for token usage variance—a single long-context call can cost 50x more than a short one. Iteration limits stop runaway loops but won't catch an agent that makes expensive calls within reasonable iteration counts.

What you actually need is hard budget enforcement at the agent level, ideally denominated in dollars rather than tokens or requests.

Why Agent-Level Budgets Matter

In a typical CrewAI setup, different agents have wildly different cost profiles:

Research agents make many calls with large contexts (expensive)
Routing agents make quick classification calls (cheap)
Writing agents generate long-form content (moderate, predictable)

A flat budget across all agents means your routing agent's allocation goes unused while your research agent blows past acceptable spend. You need granular control that maps to how each agent actually behaves in production.

The Naive Approach: Token Counting

Your first instinct might be to wrap the LLM client and count tokens:

from crewai import Agent, LLM
from collections import defaultdict
import tiktoken

class BudgetEnforcedLLM:
    def __init__(self, model_name, budget_dollars):
        self.llm = LLM(model=model_name)
        self.model_name = model_name
        self.budget_dollars = budget_dollars
        self.spent_dollars = 0.0
        self.encoding = tiktoken.encoding_for_model(model_name)

        # Simplified pricing (actual pricing is more complex)
        self.price_per_1k_input = 0.03 if "gpt-4" in model_name else 0.0015
        self.price_per_1k_output = 0.06 if "gpt-4" in model_name else 0.002

    def call(self, messages, **kwargs):
        if self.spent_dollars >= self.budget_dollars:
            raise RuntimeError(f"Budget exceeded: ${self.spent_dollars:.4f} / ${self.budget_dollars}")

        # Estimate input cost
        input_tokens = sum(len(self.encoding.encode(str(m))) for m in messages)

        response = self.llm.call(messages, **kwargs)

        # Count actual output tokens
        output_tokens = len(self.encoding.encode(response.content))

        cost = (input_tokens / 1000 * self.price_per_1k_input + 
                output_tokens / 1000 * self.price_per_1k_output)
        self.spent_dollars += cost

        return response

# Usage with CrewAI
research_llm = BudgetEnforcedLLM("gpt-4-turbo-preview", budget_dollars=5.0)
researcher = Agent(
    role="Research Analyst",
    goal="Gather comprehensive market data",
    llm=research_llm,
    verbose=True
)

This works in theory, but falls apart in practice:

Pricing complexity: OpenAI's pricing varies by model version, has different rates for cached tokens, and changes frequently
Token counting edge cases: Function calls, vision inputs, and prompt caching make accurate token estimation nearly impossible
No persistence: Restarting your process resets budgets, making daily limits impossible
No observability: You're flying blind until something breaks

Production-Grade Budget Enforcement

The real solution is intercepting calls at the API level, not in your application code. This is where a proxy layer makes sense.

AWX Shredder is built specifically for this use case—it sits between your CrewAI agents and OpenAI's API, enforcing hard budget limits per agent role without requiring code changes beyond swapping the base URL:

import os
from crewai import Agent, Crew, Task, LLM

# Configure different budget headers for each agent role
research_llm = LLM(
    model="gpt-4-turbo-preview",
    base_url="https://awx-shredder.fly.dev/proxy/v1",
    api_key=os.getenv("OPENAI_API_KEY"),
    default_headers={"X-AWX-Agent-ID": "researcher", "X-AWX-Daily-Budget": "5.00"}
)

writer_llm = LLM(
    model="gpt-4-turbo-preview",
    base_url="https://awx-shredder.fly.dev/proxy/v1",
    api_key=os.getenv("OPENAI_API_KEY"),
    default_headers={"X-AWX-Agent-ID": "writer", "X-AWX-Daily-Budget": "3.00"}
)

researcher = Agent(
    role="Research Analyst",
    goal="Gather comprehensive market data",
    llm=research_llm,
    verbose=True
)

writer = Agent(
    role="Content Writer",
    goal="Create engaging reports from research",
    llm=writer_llm,
    verbose=True
)

The proxy tracks actual API costs in real-time and returns HTTP 429 (rate limit) responses the moment an agent exceeds its daily budget. CrewAI handles this gracefully—the agent fails fast rather than racking up surprise costs.

Rolling Your Own Proxy

If you need full control, building a simple budget-enforcing proxy isn't difficult. The minimal viable version needs:

A Redis instance for persisting spend by agent ID and date
A FastAPI or Express server that proxies to api.openai.com
Middleware that checks budget before forwarding requests
A parser for OpenAI's response to extract actual costs from the usage field

The tricky parts are handling streaming responses (costs aren't known until the stream completes) and dealing with OpenAI's eventual consistency in reporting costs. You'll also need to build alerting, dashboards, and handle key rotation.

For most teams, the build-vs-buy calculation favors a managed solution unless budget enforcement is core to your product's IP.

Key Takeaways

Set agent-level budgets based on each role's expected behavior, not flat limits across your crew
Enforce budgets at the API proxy level, not in application code
Use hard blocks (HTTP 429) rather than soft warnings—agents can't learn from budget alerts
Start conservative: set daily budgets at 2x your observed costs, then tighten after a week of production data

Today's action: audit your last month of OpenAI bills, break down costs by the logical "agent roles" you're running, and set per-role daily budgets at 150% of the highest daily spend you saw. Hard limits prevent one-off disasters while giving your agents room to operate.