<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Orvi Das</title>
    <description>The latest articles on DEV Community by Orvi Das (@robat_das_3c6e956212f6408).</description>
    <link>https://dev.to/robat_das_3c6e956212f6408</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3935636%2F521b2b21-6827-4e04-9db2-bdbe1f261ed0.jpg</url>
      <title>DEV Community: Orvi Das</title>
      <link>https://dev.to/robat_das_3c6e956212f6408</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/robat_das_3c6e956212f6408"/>
    <language>en</language>
    <item>
      <title>Hermes Agent ran overnight and I woke up to a $47 bill — so I built a kill-switch</title>
      <dc:creator>Orvi Das</dc:creator>
      <pubDate>Tue, 26 May 2026 21:05:23 +0000</pubDate>
      <link>https://dev.to/robat_das_3c6e956212f6408/my-ai-agent-ran-overnight-and-i-woke-up-to-a-47-bill-so-i-built-a-kill-switch-3c9</link>
      <guid>https://dev.to/robat_das_3c6e956212f6408/my-ai-agent-ran-overnight-and-i-woke-up-to-a-47-bill-so-i-built-a-kill-switch-3c9</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/hermes-agent-2026-05-15"&gt;Hermes Agent Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What I built
&lt;/h2&gt;

&lt;p&gt;It was a Tuesday. I gave Hermes Agent a research task before bed: "analyse the top open-source agent frameworks and write a comparison report." Reasonable task. Maybe 10 minutes of work. I'd check the output in the morning.&lt;/p&gt;

&lt;p&gt;I woke up to a $47 bill and a 34-page report that no one asked for.&lt;/p&gt;

&lt;p&gt;Hermes had hit a tricky subtask around 2am, retried with different approaches, gone deeper on each one, and kept going, because that's what it's supposed to do. Autonomous agent. Autonomy is the feature. The problem is that autonomy doesn't come with a receipt until after you've already paid.&lt;/p&gt;

&lt;p&gt;I spent that morning looking for a way to give Hermes a hard spending limit. Not a dashboard alert at $40 that I'd miss while sleeping. A hard stop that fires &lt;em&gt;before&lt;/em&gt; the API call, not after. I didn't find one, so I built it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;baar-core&lt;/strong&gt; is a budget-aware proxy that sits between Hermes and the real LLM providers. Every call Hermes makes goes through a kill-switch first. When a call would push spend past the cap, it gets &lt;code&gt;402 Payment Required&lt;/code&gt;. The provider is never contacted. Cost: $0.00.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;baar.integrations.hermes&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaarHermesSession&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;BaarHermesSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.00&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;reply&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Research the top 5 open-source agent frameworks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reply&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Spent $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spent&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; of $1.00&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# It cannot spend $1.01. Not $1.005. $1.00 is the ceiling.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnx61mjm09eqe46hchfiw.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnx61mjm09eqe46hchfiw.gif" alt="Baar Demo" width="800" height="498"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/orvi2014/Baar-Core" rel="noopener noreferrer"&gt;github.com/orvi2014/Baar-Core&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;baar-core[vercel] hermes-agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Tech stack
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Python 3.10+&lt;/td&gt;
&lt;td&gt;Core library&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hermes Agent&lt;/td&gt;
&lt;td&gt;The agentic runtime being budget-capped&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LiteLLM&lt;/td&gt;
&lt;td&gt;Unified provider interface + live pricing data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FastAPI + uvicorn&lt;/td&gt;
&lt;td&gt;Local OpenAI-compatible proxy server&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQLite (WAL mode)&lt;/td&gt;
&lt;td&gt;Persistent spend store, concurrent-safe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pytest&lt;/td&gt;
&lt;td&gt;606 tests, all passing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  How I used Hermes Agent
&lt;/h2&gt;

&lt;p&gt;Hermes doesn't stop. It plans, tool-calls, reflects, retries until the task is done or you kill the process. That's the whole point of it, and also what caused the $47 bill.&lt;/p&gt;

&lt;p&gt;I couldn't change how Hermes works internally, but it lets you point its provider config at any OpenAI-compatible endpoint. So I built one: a local proxy that speaks the OpenAI API and intercepts every LLM call before it leaves the machine.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BaarHermesSession(budget=1.00)
  ├── BaarHermesProxy.start()    ← uvicorn on 127.0.0.1:8080, daemon thread
  └── hermes subprocess          ← HERMES_HOME → temp config pointing to proxy

Each Hermes LLM turn:
  POST /v1/chat/completions → baar proxy (local, no network)
    └── BAARRouter
          ├── pre-flight budget check    → over limit? 402. Zero API calls made.
          ├── complexity routing         → simple task → cheap model, hard task → big model
          └── real provider call via LiteLLM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hermes thinks it's talking to a provider. Every tool-call, retry, and reflection step goes through the check first.&lt;/p&gt;

&lt;p&gt;Getting the timing right took a few attempts. Most cost tracking records spend when the response arrives — by then you've already paid. baar estimates the cost of each call, atomically reserves that amount, makes the call, then reconciles the real cost. Two concurrent Hermes turns can't both pass the check and jointly overshoot the cap, because the reservation step is atomic.&lt;/p&gt;

&lt;h3&gt;
  
  
  It also routes to cheaper models as budget runs down
&lt;/h3&gt;

&lt;p&gt;The routing layer scores each request for complexity and picks the model accordingly. Low-complexity turns go to the cheap model, high-complexity turns go to the big one. As the budget runs low, the threshold shifts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Budget 0–30%:   complexity &amp;gt; 0.50 → big model
Budget 60–80%:  complexity &amp;gt; 0.75 → big model
Budget 95%+:    almost everything → small model
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A $1.00 session doesn't just cut off at $1.00. It gets cheaper per turn as the budget depletes. The agent keeps working, it just costs less toward the end.&lt;/p&gt;

&lt;h3&gt;
  
  
  Alerts, because a silently dead session is also bad
&lt;/h3&gt;

&lt;p&gt;Waking up to a session stuck at $0.999 since 2am, waiting on a cap that already fired, is almost as annoying as the $47 bill. So I added thresholds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;baar&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BAARRouter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BudgetWindow&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Alert&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;warn_at_80&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⚠️  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utilization&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;% of daily budget used — &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
          &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;remaining&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; remaining&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;router&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BAARRouter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;5.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BudgetWindow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DAILY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# resets at midnight UTC, no cron needed
&lt;/span&gt;    &lt;span class="n"&gt;alerts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nc"&gt;Alert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;warn_at_80&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;Alert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;send_slack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hermes at 95% — check it&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;BudgetWindow.DAILY&lt;/code&gt; resets at midnight UTC. Each day gets its own bucket. Historical spend is preserved so you can audit any past session, and the alert re-arms automatically when the new window opens.&lt;/p&gt;

&lt;h3&gt;
  
  
  A policy engine, for when a single number isn't enough
&lt;/h3&gt;

&lt;p&gt;If you're running Hermes on behalf of multiple users, you need rules. A free tier user hitting gpt-4o at 60% budget utilization is a problem waiting to happen. An enterprise user getting downgraded to gpt-4o-mini is a different kind of problem.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;baar.core.policy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Policy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Rule&lt;/span&gt;

&lt;span class="n"&gt;policy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rules&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="c1"&gt;# Free tier users: force cheap model past 50% spend
&lt;/span&gt;    &lt;span class="nc"&gt;Rule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;when&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;plan&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;free&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utilization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;= 0.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;then&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;force_small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;

    &lt;span class="c1"&gt;# Never use big model past 70% budget for anyone
&lt;/span&gt;    &lt;span class="nc"&gt;Rule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;when&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utilization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;= 0.7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;then&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;force_small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;

    &lt;span class="c1"&gt;# Enterprise users always get the big model
&lt;/span&gt;    &lt;span class="nc"&gt;Rule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;when&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;plan&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enterprise&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;then&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;force_big&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;router&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BAARRouter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;5.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rules are first-match-wins. You thread user metadata per call from your application layer. System facts like real utilization always override caller context, so users can't spoof their own budget status.&lt;/p&gt;

&lt;p&gt;When a &lt;code&gt;block&lt;/code&gt; rule fires, baar raises &lt;code&gt;PolicyViolation&lt;/code&gt;, distinct from &lt;code&gt;BudgetExhausted&lt;/code&gt;. Both carry a &lt;code&gt;facts&lt;/code&gt; dict with exactly which rule matched and why.&lt;/p&gt;

&lt;h3&gt;
  
  
  The audit log
&lt;/h3&gt;

&lt;p&gt;Every Hermes turn is logged:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Step &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;step_num&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;6.0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;ms | &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Step  1 | gpt-4o-mini          | $0.000023 |   412ms | complexity=0.31 → small
Step  2 | gpt-4o               | $0.000891 |  1823ms | complexity=0.78 → big
Step  3 | gpt-4o-mini          | $0.000019 |   388ms | complexity=0.28 → small
Step  4 | gpt-4o-mini          | $0.000021 |   401ms | [POLICY FORCE_SMALL] complexity=0.71
...
Total: $0.003847 of $1.00 (0.38% used)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;forced_by_budget&lt;/code&gt; on each step tells you whether the model downgrade was a budget constraint or a policy decision.&lt;/p&gt;

&lt;h3&gt;
  
  
  One more thing: a supply chain issue we caught mid-build
&lt;/h3&gt;

&lt;p&gt;While shipping v0.7.0 we found CVE-2026-33634, a supply chain compromise in &lt;code&gt;litellm==1.82.7&lt;/code&gt; and &lt;code&gt;1.82.8&lt;/code&gt;. Since baar-core depends on LiteLLM, any user installing without this fix would pull in the compromised version.&lt;/p&gt;

&lt;p&gt;Two defences: an install-time constraint (&lt;code&gt;!=1.82.7,!=1.82.8&lt;/code&gt;) so pip never resolves to those versions, and a runtime check that raises at &lt;code&gt;BAARRouter&lt;/code&gt; construction if the bad version is already installed. If you have it, baar won't start.&lt;/p&gt;




&lt;p&gt;The $47 bill was the useful part of that Tuesday. Turns out "iterate until done" is not a plan when you're paying per iteration and you're asleep.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/orvi2014/Baar-Core" rel="noopener noreferrer"&gt;github.com/orvi2014/Baar-Core&lt;/a&gt; — &lt;code&gt;pip install baar-core[vercel] hermes-agent&lt;/code&gt;&lt;/p&gt;

</description>
      <category>hermesagentchallenge</category>
      <category>devchallenge</category>
      <category>agents</category>
      <category>python</category>
    </item>
    <item>
      <title>I Use AI to Build. I Don't Let It Think for Me.</title>
      <dc:creator>Orvi Das</dc:creator>
      <pubDate>Sun, 17 May 2026 00:45:01 +0000</pubDate>
      <link>https://dev.to/robat_das_3c6e956212f6408/i-use-ai-to-build-i-dont-let-it-think-for-me-2mag</link>
      <guid>https://dev.to/robat_das_3c6e956212f6408/i-use-ai-to-build-i-dont-let-it-think-for-me-2mag</guid>
      <description>&lt;p&gt;I have been building software with AI tools for about two years now. I ship with Claude Code every day. And I still do not vibe code. Here is why that distinction matters more than people think.&lt;/p&gt;




&lt;p&gt;The first time I watched someone vibe code, I felt the same thing I feel watching someone drive with their knees. Technically possible. Impressive for about thirty seconds. And then just a matter of time.&lt;/p&gt;

&lt;p&gt;I use AI every single day. I am not writing this from some purist position where I compile my own tools and distrust anything generated by a machine. I use Claude to write boilerplate, draft logic, suggest patterns I would have spent an hour looking up. It has made me faster in the ways that were boring to be slow in.&lt;/p&gt;

&lt;p&gt;But I do not let it think for me.&lt;/p&gt;

&lt;p&gt;There is a difference and it matters more than almost anything else I have learned in the last two years of building with these tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Vibe Coding Actually Is
&lt;/h2&gt;

&lt;p&gt;Vibe coding is not just "using AI to help write code." That is a category error that people make to either defend or attack it. Using AI to help write code is just programming now. That is what the tools are for.&lt;/p&gt;

&lt;p&gt;Vibe coding is something specific: it is prompting without understanding, accepting without reading, and shipping without testing. It is the workflow where the developer's job becomes describing what they want and clicking approve.&lt;/p&gt;

&lt;p&gt;The output looks like software. It passes the smell test. It runs. And then, three weeks later, something quietly breaks in production and you spend an afternoon staring at code you do not actually understand, written by a model that does not remember writing it.&lt;/p&gt;

&lt;p&gt;I have seen this happen to smart people. I have started to do it myself on late nights when I was tired and the model was confident. It is seductive because the short loop feels productive. You say a thing, the code appears, it works. The feedback is immediate and positive.&lt;/p&gt;

&lt;p&gt;The cost is invisible until it is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Line I Draw
&lt;/h2&gt;

&lt;p&gt;My workflow goes in one direction: I understand first, then I use AI to execute faster.&lt;/p&gt;

&lt;p&gt;That means I write the unit test before I ask the model for the implementation. Not because I am rigorous by nature — I am not — but because writing the test forces me to know what I actually want. It forces me to think about edge cases before I have code that creates attachment to a specific approach. It forces me to have a definition of done that exists outside my head.&lt;/p&gt;

&lt;p&gt;When I hand that context to the model, the output is different. Not because the model is smarter — it is the same model but because I am asking a specific, bounded question instead of a vague, open-ended one. The difference between "&lt;strong&gt;write me a function that handles payments&lt;/strong&gt;" and "&lt;strong&gt;write me a function that takes a Stripe webhook payload, validates the signature, extracts the event type, and returns a typed result with this shape&lt;/strong&gt;" is the difference between code that kind of works and code that actually does the thing.&lt;/p&gt;

&lt;p&gt;Then I read what comes back. All of it. Even when it is long. Especially when it is long.&lt;/p&gt;

&lt;p&gt;This sounds obvious. It is not practiced as much as it sounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AI Is Actually Good At
&lt;/h2&gt;

&lt;p&gt;The honest list of where AI makes me dramatically faster:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Boilerplate that I know the shape of.&lt;/strong&gt; If I know I need a repository pattern with these five methods, I can describe it precisely and get it in thirty seconds instead of fifteen minutes. I understand what it should look like before I ask. The AI just types faster than I do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Surfacing options I had not thought of.&lt;/strong&gt; I will describe a problem and ask what patterns exist for solving it, not for the model to pick one, but so I have a more complete menu. Then I decide. The model does not know my codebase, my constraints, or my risk tolerance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Catching things I missed.&lt;/strong&gt; After I write something, I ask the model to review it — specifically to look for edge cases, error paths I did not handle, security issues I glossed over. It finds real things. Not always, but often enough that I have made it a habit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Writing tests for logic I just wrote.&lt;/strong&gt; Once the implementation is done and I understand it, I will have the model write additional test cases. It is good at thinking of inputs I did not try.&lt;/p&gt;

&lt;p&gt;What it is not good at: deciding what to build, deciding how to architect something that will need to scale, or writing code I am not equipped to review. When I catch myself in that last situation, I stop and learn the thing first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Senior Developer Problem
&lt;/h2&gt;

&lt;p&gt;There is a version of this conversation that gets framed as: AI will replace junior developers but senior developers are safe because they can guide it.&lt;/p&gt;

&lt;p&gt;I do not think this is quite right, and I think believing it creates a complacency that is more dangerous than the replacement question.&lt;/p&gt;

&lt;p&gt;The thing that makes a senior developer valuable is not primarily the ability to generate correct code. It is the ability to know which code should not exist, which abstractions will turn into debt, which requirements are wrong before you build them. That judgment comes from having been wrong enough times to develop taste.&lt;/p&gt;

&lt;p&gt;AI does not have taste. It has pattern completion. It will write you a technically correct solution to the wrong problem with the same confidence it writes a technically correct solution to the right one. It cannot tell the difference.&lt;/p&gt;

&lt;p&gt;If you are not developing the judgment — because you are outsourcing the thinking to the model — you are not building toward senior. You are extending the period where you do not yet know what you do not know.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Is Not About Being Anti-AI
&lt;/h2&gt;

&lt;p&gt;I am not arguing for slowing down or using fewer tools. I am arguing for staying in the driver's seat of your own work.&lt;/p&gt;

&lt;p&gt;The people I have watched get the most out of these tools are the ones who get more done, not the ones who get more generated. The difference is that they know what done means before they start, and they verify they reached it before they ship.&lt;/p&gt;

&lt;p&gt;I write unit tests first. Then integration tests against real systems, not mocks — I learned the hard way that mocked tests can pass while the actual integration is broken. Then end-to-end tests with Playwright for the paths users actually take. And I read everything the model gives me before I commit it.&lt;/p&gt;

&lt;p&gt;That workflow is slower than vibe coding for the first hour. It is faster than vibe coding over any meaningful timescale.&lt;/p&gt;

&lt;p&gt;The AI handles the typing. I stay responsible for the thinking.&lt;/p&gt;

&lt;p&gt;That is the only arrangement I trust.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>automation</category>
    </item>
  </channel>
</rss>
