Orvi Das

Posted on May 26

Hermes Agent ran overnight and I woke up to a $47 bill — so I built a kill-switch

#devchallenge #hermesagentchallenge #agents #python

Hermes Agent Challenge Submission: Build With Hermes Agent

This is a submission for the Hermes Agent Challenge

What I built

It was a Tuesday. I gave Hermes Agent a research task before bed: "analyse the top open-source agent frameworks and write a comparison report." Reasonable task. Maybe 10 minutes of work. I'd check the output in the morning.

I woke up to a $47 bill and a 34-page report that no one asked for.

Hermes had hit a tricky subtask around 2am, retried with different approaches, gone deeper on each one, and kept going, because that's what it's supposed to do. Autonomous agent. Autonomy is the feature. The problem is that autonomy doesn't come with a receipt until after you've already paid.

I spent that morning looking for a way to give Hermes a hard spending limit. Not a dashboard alert at $40 that I'd miss while sleeping. A hard stop that fires before the API call, not after. I didn't find one, so I built it.

baar-core is a budget-aware proxy that sits between Hermes and the real LLM providers. Every call Hermes makes goes through a kill-switch first. When a call would push spend past the cap, it gets 402 Payment Required. The provider is never contacted. Cost: $0.00.

from baar.integrations.hermes import BaarHermesSession

with BaarHermesSession(budget=1.00) as session:
    reply = session.run_task("Research the top 5 open-source agent frameworks")
    print(reply)
    print(f"Spent ${session.spent:.4f} of $1.00")

# It cannot spend $1.01. Not $1.005. $1.00 is the ceiling.

Demo

Code

GitHub: github.com/orvi2014/Baar-Core

pip install baar-core[vercel] hermes-agent

Tech stack

Component	Role
Python 3.10+	Core library
Hermes Agent	The agentic runtime being budget-capped
LiteLLM	Unified provider interface + live pricing data
FastAPI + uvicorn	Local OpenAI-compatible proxy server
SQLite (WAL mode)	Persistent spend store, concurrent-safe
pytest	606 tests, all passing

How I used Hermes Agent

Hermes doesn't stop. It plans, tool-calls, reflects, retries until the task is done or you kill the process. That's the whole point of it, and also what caused the $47 bill.

I couldn't change how Hermes works internally, but it lets you point its provider config at any OpenAI-compatible endpoint. So I built one: a local proxy that speaks the OpenAI API and intercepts every LLM call before it leaves the machine.

BaarHermesSession(budget=1.00)
  ├── BaarHermesProxy.start()    ← uvicorn on 127.0.0.1:8080, daemon thread
  └── hermes subprocess          ← HERMES_HOME → temp config pointing to proxy

Each Hermes LLM turn:
  POST /v1/chat/completions → baar proxy (local, no network)
    └── BAARRouter
          ├── pre-flight budget check    → over limit? 402. Zero API calls made.
          ├── complexity routing         → simple task → cheap model, hard task → big model
          └── real provider call via LiteLLM

Hermes thinks it's talking to a provider. Every tool-call, retry, and reflection step goes through the check first.

Getting the timing right took a few attempts. Most cost tracking records spend when the response arrives — by then you've already paid. baar estimates the cost of each call, atomically reserves that amount, makes the call, then reconciles the real cost. Two concurrent Hermes turns can't both pass the check and jointly overshoot the cap, because the reservation step is atomic.

It also routes to cheaper models as budget runs down

The routing layer scores each request for complexity and picks the model accordingly. Low-complexity turns go to the cheap model, high-complexity turns go to the big one. As the budget runs low, the threshold shifts:

Budget 0–30%:   complexity > 0.50 → big model
Budget 60–80%:  complexity > 0.75 → big model
Budget 95%+:    almost everything → small model

A $1.00 session doesn't just cut off at $1.00. It gets cheaper per turn as the budget depletes. The agent keeps working, it just costs less toward the end.

Alerts, because a silently dead session is also bad

Waking up to a session stuck at $0.999 since 2am, waiting on a cap that already fired, is almost as annoying as the $47 bill. So I added thresholds:

from baar import BAARRouter, BudgetWindow, Alert

def warn_at_80(info):
    print(f"⚠️  {info['utilization']*100:.0f}% of daily budget used — "
          f"${info['remaining']:.4f} remaining")

router = BAARRouter(
    budget=5.00,
    window=BudgetWindow.DAILY,   # resets at midnight UTC, no cron needed
    alerts=[
        Alert(threshold=0.8, callback=warn_at_80),
        Alert(threshold=0.95, callback=lambda _: send_slack("Hermes at 95% — check it")),
    ],
)

BudgetWindow.DAILY resets at midnight UTC. Each day gets its own bucket. Historical spend is preserved so you can audit any past session, and the alert re-arms automatically when the new window opens.

A policy engine, for when a single number isn't enough

If you're running Hermes on behalf of multiple users, you need rules. A free tier user hitting gpt-4o at 60% budget utilization is a problem waiting to happen. An enterprise user getting downgraded to gpt-4o-mini is a different kind of problem.

from baar.core.policy import Policy, Rule

policy = Policy(rules=[
    # Free tier users: force cheap model past 50% spend
    Rule(when={"plan": "free", "utilization": ">= 0.5"}, then="force_small"),

    # Never use big model past 70% budget for anyone
    Rule(when={"utilization": ">= 0.7"}, then="force_small"),

    # Enterprise users always get the big model
    Rule(when={"plan": "enterprise"}, then="force_big"),
])

router = BAARRouter(budget=5.00, policy=policy)

Rules are first-match-wins. You thread user metadata per call from your application layer. System facts like real utilization always override caller context, so users can't spoof their own budget status.

When a block rule fires, baar raises PolicyViolation, distinct from BudgetExhausted. Both carry a facts dict with exactly which rule matched and why.

The audit log

Every Hermes turn is logged:

for step in session.log.steps:
    print(
        f"Step {step.step_num:2d} | {step.decision.model:<20} | "
        f"${step.cost:.6f} | {step.latency_ms:6.0f}ms | "
        f"{step.decision.reason}"
    )

Step  1 | gpt-4o-mini          | $0.000023 |   412ms | complexity=0.31 → small
Step  2 | gpt-4o               | $0.000891 |  1823ms | complexity=0.78 → big
Step  3 | gpt-4o-mini          | $0.000019 |   388ms | complexity=0.28 → small
Step  4 | gpt-4o-mini          | $0.000021 |   401ms | [POLICY FORCE_SMALL] complexity=0.71
...
Total: $0.003847 of $1.00 (0.38% used)

forced_by_budget on each step tells you whether the model downgrade was a budget constraint or a policy decision.

One more thing: a supply chain issue we caught mid-build

While shipping v0.7.0 we found CVE-2026-33634, a supply chain compromise in litellm==1.82.7 and 1.82.8. Since baar-core depends on LiteLLM, any user installing without this fix would pull in the compromised version.

Two defences: an install-time constraint (!=1.82.7,!=1.82.8) so pip never resolves to those versions, and a runtime check that raises at BAARRouter construction if the bad version is already installed. If you have it, baar won't start.

The $47 bill was the useful part of that Tuesday. Turns out "iterate until done" is not a plan when you're paying per iteration and you're asleep.

GitHub: github.com/orvi2014/Baar-Core — pip install baar-core[vercel] hermes-agent

Top comments (5)

Mykola Kondratiuk • Jun 1

kill switch is a band-aid - stop conditions per subtask defined before the run is the real fix.

Harjot Singh • May 31

A 34-page report nobody asked for is the perfect artifact of the real failure: the agent did exactly what it was built to do (hit a hard subtask, retry deeper, keep going) and the missing piece was a bound, not better behavior. There was nothing telling it 10-minutes-of-work means stop around 10 minutes of work. The kill-switch is the right reflex, and I'd argue it's not optional, it's the single most important thing to build before you ever let an agent run unattended. The dimensions worth bounding: a hard token/cost ceiling per run (the one that would have capped your $47), a wall-clock or step limit, and a no-progress detector so 2am-retry-spirals trip a brake instead of compounding. The deeper principle: an agent will always eventually find a loop or a rabbit hole, so cost discipline has to be a structural limit it cannot exceed, not a hope that the task was small. Unattended autonomy without a ceiling is just a bill waiting to happen. That hard-budget-as-a-guardrail is core to how I build runs in Moonshift. Did your kill-switch end up triggering on cost, on step count, or on a no-progress signal, and which caught the most real runaways?

Theo Valmis • Jun 2

Kill-switches catch the symptom. The deeper issue is usually that the agent was looping on a task it couldn't complete because the constraints weren't pinned down. A budget cap saves the wallet, but tighter task scoping prevents the loop from starting at all.

xulingfeng • May 27

The budget-aware proxy approach is neat — hard stop before the API call is definitely better than an alert you might wake up to at 3am 😅

We had similar paranoia, but went the other direction: pinning the default model to DeepSeek V4 Flash ($0.14/M tokens). Our overnight Hermes runs cost maybe $0.30, and we only escalate to expensive models when we explicitly switch to Pro mode for deep reasoning.

Question though: does the 402 proxy handle streaming calls correctly? We found that with streaming, the token count isn't known upfront, so any kill-switch has to either buffer or have a generous margin. How did you handle that?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.