This is a submission for the Hermes Agent Challenge
What I built
It was a Tuesday. I gave Hermes Agent a research task before bed: "analyse the top open-source agent frameworks and write a comparison report." Reasonable task. Maybe 10 minutes of work. I'd check the output in the morning.
I woke up to a $47 bill and a 34-page report that no one asked for.
Hermes had hit a tricky subtask around 2am, retried with different approaches, gone deeper on each one, and kept going, because that's what it's supposed to do. Autonomous agent. Autonomy is the feature. The problem is that autonomy doesn't come with a receipt until after you've already paid.
I spent that morning looking for a way to give Hermes a hard spending limit. Not a dashboard alert at $40 that I'd miss while sleeping. A hard stop that fires before the API call, not after. I didn't find one, so I built it.
baar-core is a budget-aware proxy that sits between Hermes and the real LLM providers. Every call Hermes makes goes through a kill-switch first. When a call would push spend past the cap, it gets 402 Payment Required. The provider is never contacted. Cost: $0.00.
from baar.integrations.hermes import BaarHermesSession
with BaarHermesSession(budget=1.00) as session:
reply = session.run_task("Research the top 5 open-source agent frameworks")
print(reply)
print(f"Spent ${session.spent:.4f} of $1.00")
# It cannot spend $1.01. Not $1.005. $1.00 is the ceiling.
Demo
Code
GitHub: github.com/orvi2014/Baar-Core
pip install baar-core[vercel] hermes-agent
Tech stack
| Component | Role |
|---|---|
| Python 3.10+ | Core library |
| Hermes Agent | The agentic runtime being budget-capped |
| LiteLLM | Unified provider interface + live pricing data |
| FastAPI + uvicorn | Local OpenAI-compatible proxy server |
| SQLite (WAL mode) | Persistent spend store, concurrent-safe |
| pytest | 606 tests, all passing |
How I used Hermes Agent
Hermes doesn't stop. It plans, tool-calls, reflects, retries until the task is done or you kill the process. That's the whole point of it, and also what caused the $47 bill.
I couldn't change how Hermes works internally, but it lets you point its provider config at any OpenAI-compatible endpoint. So I built one: a local proxy that speaks the OpenAI API and intercepts every LLM call before it leaves the machine.
BaarHermesSession(budget=1.00)
├── BaarHermesProxy.start() ← uvicorn on 127.0.0.1:8080, daemon thread
└── hermes subprocess ← HERMES_HOME → temp config pointing to proxy
Each Hermes LLM turn:
POST /v1/chat/completions → baar proxy (local, no network)
└── BAARRouter
├── pre-flight budget check → over limit? 402. Zero API calls made.
├── complexity routing → simple task → cheap model, hard task → big model
└── real provider call via LiteLLM
Hermes thinks it's talking to a provider. Every tool-call, retry, and reflection step goes through the check first.
Getting the timing right took a few attempts. Most cost tracking records spend when the response arrives — by then you've already paid. baar estimates the cost of each call, atomically reserves that amount, makes the call, then reconciles the real cost. Two concurrent Hermes turns can't both pass the check and jointly overshoot the cap, because the reservation step is atomic.
It also routes to cheaper models as budget runs down
The routing layer scores each request for complexity and picks the model accordingly. Low-complexity turns go to the cheap model, high-complexity turns go to the big one. As the budget runs low, the threshold shifts:
Budget 0–30%: complexity > 0.50 → big model
Budget 60–80%: complexity > 0.75 → big model
Budget 95%+: almost everything → small model
A $1.00 session doesn't just cut off at $1.00. It gets cheaper per turn as the budget depletes. The agent keeps working, it just costs less toward the end.
Alerts, because a silently dead session is also bad
Waking up to a session stuck at $0.999 since 2am, waiting on a cap that already fired, is almost as annoying as the $47 bill. So I added thresholds:
from baar import BAARRouter, BudgetWindow, Alert
def warn_at_80(info):
print(f"⚠️ {info['utilization']*100:.0f}% of daily budget used — "
f"${info['remaining']:.4f} remaining")
router = BAARRouter(
budget=5.00,
window=BudgetWindow.DAILY, # resets at midnight UTC, no cron needed
alerts=[
Alert(threshold=0.8, callback=warn_at_80),
Alert(threshold=0.95, callback=lambda _: send_slack("Hermes at 95% — check it")),
],
)
BudgetWindow.DAILY resets at midnight UTC. Each day gets its own bucket. Historical spend is preserved so you can audit any past session, and the alert re-arms automatically when the new window opens.
A policy engine, for when a single number isn't enough
If you're running Hermes on behalf of multiple users, you need rules. A free tier user hitting gpt-4o at 60% budget utilization is a problem waiting to happen. An enterprise user getting downgraded to gpt-4o-mini is a different kind of problem.
from baar.core.policy import Policy, Rule
policy = Policy(rules=[
# Free tier users: force cheap model past 50% spend
Rule(when={"plan": "free", "utilization": ">= 0.5"}, then="force_small"),
# Never use big model past 70% budget for anyone
Rule(when={"utilization": ">= 0.7"}, then="force_small"),
# Enterprise users always get the big model
Rule(when={"plan": "enterprise"}, then="force_big"),
])
router = BAARRouter(budget=5.00, policy=policy)
Rules are first-match-wins. You thread user metadata per call from your application layer. System facts like real utilization always override caller context, so users can't spoof their own budget status.
When a block rule fires, baar raises PolicyViolation, distinct from BudgetExhausted. Both carry a facts dict with exactly which rule matched and why.
The audit log
Every Hermes turn is logged:
for step in session.log.steps:
print(
f"Step {step.step_num:2d} | {step.decision.model:<20} | "
f"${step.cost:.6f} | {step.latency_ms:6.0f}ms | "
f"{step.decision.reason}"
)
Step 1 | gpt-4o-mini | $0.000023 | 412ms | complexity=0.31 → small
Step 2 | gpt-4o | $0.000891 | 1823ms | complexity=0.78 → big
Step 3 | gpt-4o-mini | $0.000019 | 388ms | complexity=0.28 → small
Step 4 | gpt-4o-mini | $0.000021 | 401ms | [POLICY FORCE_SMALL] complexity=0.71
...
Total: $0.003847 of $1.00 (0.38% used)
forced_by_budget on each step tells you whether the model downgrade was a budget constraint or a policy decision.
One more thing: a supply chain issue we caught mid-build
While shipping v0.7.0 we found CVE-2026-33634, a supply chain compromise in litellm==1.82.7 and 1.82.8. Since baar-core depends on LiteLLM, any user installing without this fix would pull in the compromised version.
Two defences: an install-time constraint (!=1.82.7,!=1.82.8) so pip never resolves to those versions, and a runtime check that raises at BAARRouter construction if the bad version is already installed. If you have it, baar won't start.
The $47 bill was the useful part of that Tuesday. Turns out "iterate until done" is not a plan when you're paying per iteration and you're asleep.
GitHub: github.com/orvi2014/Baar-Core — pip install baar-core[vercel] hermes-agent

Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.