DEV Community

Cover image for Hermes Agent ran overnight and I woke up to a $47 bill — so I built a kill-switch
Orvi Das
Orvi Das

Posted on

Hermes Agent ran overnight and I woke up to a $47 bill — so I built a kill-switch

Hermes Agent Challenge Submission: Build With Hermes Agent

This is a submission for the Hermes Agent Challenge


What I built

It was a Tuesday. I gave Hermes Agent a research task before bed: "analyse the top open-source agent frameworks and write a comparison report." Reasonable task. Maybe 10 minutes of work. I'd check the output in the morning.

I woke up to a $47 bill and a 34-page report that no one asked for.

Hermes had hit a tricky subtask around 2am, retried with different approaches, gone deeper on each one, and kept going, because that's what it's supposed to do. Autonomous agent. Autonomy is the feature. The problem is that autonomy doesn't come with a receipt until after you've already paid.

I spent that morning looking for a way to give Hermes a hard spending limit. Not a dashboard alert at $40 that I'd miss while sleeping. A hard stop that fires before the API call, not after. I didn't find one, so I built it.

baar-core is a budget-aware proxy that sits between Hermes and the real LLM providers. Every call Hermes makes goes through a kill-switch first. When a call would push spend past the cap, it gets 402 Payment Required. The provider is never contacted. Cost: $0.00.

from baar.integrations.hermes import BaarHermesSession

with BaarHermesSession(budget=1.00) as session:
    reply = session.run_task("Research the top 5 open-source agent frameworks")
    print(reply)
    print(f"Spent ${session.spent:.4f} of $1.00")

# It cannot spend $1.01. Not $1.005. $1.00 is the ceiling.
Enter fullscreen mode Exit fullscreen mode

Demo

Baar Demo


Code

GitHub: github.com/orvi2014/Baar-Core

pip install baar-core[vercel] hermes-agent
Enter fullscreen mode Exit fullscreen mode

Tech stack

Component Role
Python 3.10+ Core library
Hermes Agent The agentic runtime being budget-capped
LiteLLM Unified provider interface + live pricing data
FastAPI + uvicorn Local OpenAI-compatible proxy server
SQLite (WAL mode) Persistent spend store, concurrent-safe
pytest 606 tests, all passing

How I used Hermes Agent

Hermes doesn't stop. It plans, tool-calls, reflects, retries until the task is done or you kill the process. That's the whole point of it, and also what caused the $47 bill.

I couldn't change how Hermes works internally, but it lets you point its provider config at any OpenAI-compatible endpoint. So I built one: a local proxy that speaks the OpenAI API and intercepts every LLM call before it leaves the machine.

BaarHermesSession(budget=1.00)
  ├── BaarHermesProxy.start()    ← uvicorn on 127.0.0.1:8080, daemon thread
  └── hermes subprocess          ← HERMES_HOME → temp config pointing to proxy

Each Hermes LLM turn:
  POST /v1/chat/completions → baar proxy (local, no network)
    └── BAARRouter
          ├── pre-flight budget check    → over limit? 402. Zero API calls made.
          ├── complexity routing         → simple task → cheap model, hard task → big model
          └── real provider call via LiteLLM
Enter fullscreen mode Exit fullscreen mode

Hermes thinks it's talking to a provider. Every tool-call, retry, and reflection step goes through the check first.

Getting the timing right took a few attempts. Most cost tracking records spend when the response arrives — by then you've already paid. baar estimates the cost of each call, atomically reserves that amount, makes the call, then reconciles the real cost. Two concurrent Hermes turns can't both pass the check and jointly overshoot the cap, because the reservation step is atomic.

It also routes to cheaper models as budget runs down

The routing layer scores each request for complexity and picks the model accordingly. Low-complexity turns go to the cheap model, high-complexity turns go to the big one. As the budget runs low, the threshold shifts:

Budget 0–30%:   complexity > 0.50 → big model
Budget 60–80%:  complexity > 0.75 → big model
Budget 95%+:    almost everything → small model
Enter fullscreen mode Exit fullscreen mode

A $1.00 session doesn't just cut off at $1.00. It gets cheaper per turn as the budget depletes. The agent keeps working, it just costs less toward the end.

Alerts, because a silently dead session is also bad

Waking up to a session stuck at $0.999 since 2am, waiting on a cap that already fired, is almost as annoying as the $47 bill. So I added thresholds:

from baar import BAARRouter, BudgetWindow, Alert

def warn_at_80(info):
    print(f"⚠️  {info['utilization']*100:.0f}% of daily budget used — "
          f"${info['remaining']:.4f} remaining")

router = BAARRouter(
    budget=5.00,
    window=BudgetWindow.DAILY,   # resets at midnight UTC, no cron needed
    alerts=[
        Alert(threshold=0.8, callback=warn_at_80),
        Alert(threshold=0.95, callback=lambda _: send_slack("Hermes at 95% — check it")),
    ],
)
Enter fullscreen mode Exit fullscreen mode

BudgetWindow.DAILY resets at midnight UTC. Each day gets its own bucket. Historical spend is preserved so you can audit any past session, and the alert re-arms automatically when the new window opens.

A policy engine, for when a single number isn't enough

If you're running Hermes on behalf of multiple users, you need rules. A free tier user hitting gpt-4o at 60% budget utilization is a problem waiting to happen. An enterprise user getting downgraded to gpt-4o-mini is a different kind of problem.

from baar.core.policy import Policy, Rule

policy = Policy(rules=[
    # Free tier users: force cheap model past 50% spend
    Rule(when={"plan": "free", "utilization": ">= 0.5"}, then="force_small"),

    # Never use big model past 70% budget for anyone
    Rule(when={"utilization": ">= 0.7"}, then="force_small"),

    # Enterprise users always get the big model
    Rule(when={"plan": "enterprise"}, then="force_big"),
])

router = BAARRouter(budget=5.00, policy=policy)
Enter fullscreen mode Exit fullscreen mode

Rules are first-match-wins. You thread user metadata per call from your application layer. System facts like real utilization always override caller context, so users can't spoof their own budget status.

When a block rule fires, baar raises PolicyViolation, distinct from BudgetExhausted. Both carry a facts dict with exactly which rule matched and why.

The audit log

Every Hermes turn is logged:

for step in session.log.steps:
    print(
        f"Step {step.step_num:2d} | {step.decision.model:<20} | "
        f"${step.cost:.6f} | {step.latency_ms:6.0f}ms | "
        f"{step.decision.reason}"
    )
Enter fullscreen mode Exit fullscreen mode
Step  1 | gpt-4o-mini          | $0.000023 |   412ms | complexity=0.31 → small
Step  2 | gpt-4o               | $0.000891 |  1823ms | complexity=0.78 → big
Step  3 | gpt-4o-mini          | $0.000019 |   388ms | complexity=0.28 → small
Step  4 | gpt-4o-mini          | $0.000021 |   401ms | [POLICY FORCE_SMALL] complexity=0.71
...
Total: $0.003847 of $1.00 (0.38% used)
Enter fullscreen mode Exit fullscreen mode

forced_by_budget on each step tells you whether the model downgrade was a budget constraint or a policy decision.

One more thing: a supply chain issue we caught mid-build

While shipping v0.7.0 we found CVE-2026-33634, a supply chain compromise in litellm==1.82.7 and 1.82.8. Since baar-core depends on LiteLLM, any user installing without this fix would pull in the compromised version.

Two defences: an install-time constraint (!=1.82.7,!=1.82.8) so pip never resolves to those versions, and a runtime check that raises at BAARRouter construction if the bad version is already installed. If you have it, baar won't start.


The $47 bill was the useful part of that Tuesday. Turns out "iterate until done" is not a plan when you're paying per iteration and you're asleep.

GitHub: github.com/orvi2014/Baar-Corepip install baar-core[vercel] hermes-agent

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.