You deploy a four-agent pipeline that should cost about $0.80 per run. By end of day it has burned through $47 on a single stuck researcher loop. Sound familiar?
If you're running AI agents in production, cost blowups are not a question of if but when. 57% of organizations already have agents in production, yet 90% of agent projects fail within 30 days — and runaway LLM costs are the number one pain point.
The core problem: agents make autonomous decisions about how many LLM calls to issue. A retry loop, an overly verbose chain-of-thought, or a stuck tool call can silently 10x your bill before you notice.
The Current State of Agent Cost Control
Most teams handle this with one of three approaches, all of which fall short:
Manual monitoring. You watch dashboards and kill processes when costs spike. This works until you're asleep, in a meeting, or running 20 agents in parallel.
Provider-level spending caps. OpenAI and Anthropic offer monthly limits, but they're account-wide. You can't set a $5 budget for a specific research pipeline while allowing your coding agent $50.
Gateway proxies (Helicone, Portkey). These require routing all traffic through an external service. They add latency, a point of failure, and vendor lock-in. And they still don't give you per-agent circuit breakers.
What's missing is a framework-native solution: something that hooks directly into CrewAI, AutoGen, or LangGraph at the process level, enforces hard limits before each LLM call, and trips a circuit breaker when things go wrong — without requiring any external infrastructure.
Introducing agent-cost-guardrails
agent-cost-guardrails is an open-source Python library that does exactly this. Pure Python, zero infrastructure, framework-native hooks.
pip install agent-cost-guardrails
Here's what it gives you:
-
Hard budget limits — raises
BudgetExceededErrorwhen spend exceeds your cap - Per-call token limits — prevents any single LLM call from consuming too many tokens
- Rate limiting — tokens-per-minute sliding window to control burst spend
- Circuit breaker — trips after N consecutive violations, requires manual reset
- Alert callbacks — fire at configurable thresholds (50%, 80%, 100%)
- Cost breakdown — track spend by model and by agent
- Bundled pricing — 30+ models from OpenAI, Anthropic, Google, Mistral, DeepSeek, and Meta
Quick Start: The Context Manager
The simplest way to use it:
from agent_cost_guardrails import BudgetGuard
with BudgetGuard(max_usd=5.00) as guard:
guard.pre_call_check(estimated_tokens=2000)
# ... your LLM call here ...
cost = guard.post_call_record(
model="gpt-4o",
input_tokens=1500,
output_tokens=800
)
print(f"Call cost: ${cost:.4f}")
print(guard.cost_report())
pre_call_check() validates the budget, rate limit, and circuit breaker before the call happens. post_call_record() tracks the actual spend. If the budget is exceeded, BudgetExceededError stops execution immediately.
CrewAI Integration
CrewAI is the most popular multi-agent framework, and it has the worst cost visibility out of the box. The logging inside Tasks is broken, and there's no built-in token cap.
from crewai import Agent, Task, Crew
from agent_cost_guardrails.integrations import CrewAIGuardrails
def cost_alert(threshold, current, budget):
print(f"WARNING: {threshold*100:.0f}% of ${budget:.2f} budget used")
guards = CrewAIGuardrails(
max_usd=5.00,
max_tokens_per_call=4096,
on_alert=cost_alert
)
guards.install()
researcher = Agent(
role="Market Researcher",
goal="Find competitor pricing data",
llm="gpt-4o"
)
task = Task(
description="Research competitor pricing for SaaS analytics tools",
agent=researcher
)
crew = Crew(agents=[researcher], tasks=[task])
crew.kickoff()
report = guards.cost_report()
print(f"Total: ${report['total_cost_usd']:.4f}")
print(f"By agent: {report['cost_by_agent']}")
guards.uninstall()
guards.install() registers @before_llm_call and @after_llm_call hooks globally. Every LLM call CrewAI makes — across all agents and tasks — gets checked and tracked automatically.
AutoGen / AG2 Integration
AutoGen's register_hook() system gives us a clean interception point:
from autogen import AssistantAgent, UserProxyAgent
from agent_cost_guardrails.integrations import AutoGenGuardrails
guards = AutoGenGuardrails(max_usd=10.00)
assistant = AssistantAgent("analyst", llm_config={"model": "gpt-4o"})
proxy = UserProxyAgent("user", human_input_mode="NEVER")
guards.wrap_agent(assistant)
guards.wrap_agent(proxy)
proxy.initiate_chat(assistant, message="Analyze Q1 revenue trends")
print(guards.cost_report())
The library uses AG2's safeguard_llm_inputs / safeguard_llm_outputs hooks with automatic fallback to legacy hook names for older AutoGen versions.
LangGraph / LangChain Integration
LangGraph uses LangChain's callback system, so the integration plugs in via BaseCallbackHandler:
from langgraph.graph import StateGraph
from agent_cost_guardrails.integrations import LangGraphGuardrails
guards = LangGraphGuardrails(max_usd=2.00)
graph = build_your_graph() # your StateGraph
result = graph.invoke(
initial_state,
config={"callbacks": [guards.callback_handler]}
)
report = guards.cost_report()
print(f"Remaining budget: ${report['remaining_usd']:.2f}")
The callback handler intercepts on_llm_start, on_chat_model_start, and on_llm_end events. It extracts actual token usage from the response when available, and falls back to tiktoken estimation when not.
Circuit Breaker: Your Safety Net
The circuit breaker is what separates this from basic cost tracking. When an agent enters a failure loop — retrying the same failed tool call, generating invalid outputs, or hitting rate limits — the circuit breaker trips after N consecutive violations and stops all LLM calls until you explicitly reset it.
guard = BudgetGuard(
max_usd=10.00,
max_tokens_per_call=8192,
circuit_breaker_max_violations=3
)
# After 3 consecutive per-call violations:
# CircuitBreakerTrippedError is raised on the next pre_call_check()
# All agents stop. No more silent cost accumulation.
This is the difference between a $5 mistake and a $500 one.
Custom Model Pricing
The library ships with pricing for 30+ models, but you can override or extend it:
from agent_cost_guardrails import set_custom_pricing
set_custom_pricing({
"my-fine-tuned-gpt4": {
"input_per_mtok": 6.00,
"output_per_mtok": 18.00,
}
})
Pricing is maintained per million tokens (input and output separately) and supports prefix matching — so gpt-4o-2024-05-13 automatically resolves to the gpt-4o price.
Before and After
Before agent-cost-guardrails:
- 4-agent pipeline expected cost: ~$0.80/run
- Actual cost with stuck loops: $3–$12/run
- Worst case (overnight): $47 in a single stuck session
- Discovery: next morning, checking the billing dashboard
After:
- Hard cap: $2.00/run, enforced at the framework level
- Circuit breaker trips after 3 violations — no silent loops
- Alert at 80% budget → you get notified before hitting the cap
- Per-agent breakdown → identify which agent is the problem
Getting Started
pip install agent-cost-guardrails # core
pip install agent-cost-guardrails[crewai] # + CrewAI hooks
pip install agent-cost-guardrails[autogen] # + AutoGen hooks
pip install agent-cost-guardrails[langgraph] # + LangGraph callbacks
pip install agent-cost-guardrails[all] # everything
The library is MIT-licensed. Source, docs, and examples on GitHub:
- PyPI: agent-cost-guardrails
- GitHub: sapph1re/agent-cost-guardrails
If you're running agents in production and haven't had a cost blowup yet, you will. The question is whether you'll catch it at $2 or at $200.
Top comments (1)
In my experience with enterprise teams, the real cost issue often arises not from the LLMs themselves but from inefficient prompt engineering. Simple tweaks like optimizing token usage can significantly lower costs. For example, succinct prompts that focus on clarity can reduce unnecessary token consumption. Additionally, integrating AI agents into pre-existing workflows with clear budget guards is crucial. It's not just about setting limits, but about understanding how the AI fits into the team's daily operations. - Ali Muwwakkil (ali-muwwakkil on LinkedIn)