<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: amabito</title>
    <description>The latest articles on DEV Community by amabito (@amabito).</description>
    <link>https://dev.to/amabito</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3786711%2Ffc780c0a-3824-4907-960c-13444c988bef.png</url>
      <title>DEV Community: amabito</title>
      <link>https://dev.to/amabito</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/amabito"/>
    <language>en</language>
    <item>
      <title>The $0.64 bug: how nested retries silently multiply your LLM costs</title>
      <dc:creator>amabito</dc:creator>
      <pubDate>Mon, 09 Mar 2026 09:08:12 +0000</pubDate>
      <link>https://dev.to/amabito/the-064-bug-how-nested-retries-silently-multiply-your-llm-costs-3g0p</link>
      <guid>https://dev.to/amabito/the-064-bug-how-nested-retries-silently-multiply-your-llm-costs-3g0p</guid>
      <description>&lt;p&gt;One user click. One document. My LangChain agent made 64 API calls to GPT-4o before it finally returned a result.&lt;/p&gt;

&lt;p&gt;At typical GPT-4o pricing, that turns a one-cent task into a sixty-cent task. With longer prompts, worse.&lt;/p&gt;

&lt;p&gt;The agent wasn't broken. The bug was in how the retries &lt;em&gt;multiply&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem: retries stack, and nobody tracks the total
&lt;/h2&gt;

&lt;p&gt;This pattern shows up in most LLM agent stacks I've looked at:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Your application code        retries 3 times on failure
  calls a LangChain chain    retries 3 times on failure
    which calls a tool        retries 3 times on failure
      which calls the LLM API
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each layer is reasonable on its own. 3 retries is a perfectly normal default.&lt;/p&gt;

&lt;p&gt;When the LLM returns a transient error, the innermost layer retries 3 times. The middle layer sees a failure, retries -- triggering 3 more inner retries each time. Outer layer does the same.&lt;/p&gt;

&lt;p&gt;Worst case: &lt;strong&gt;4 x 4 x 4 = 64 API calls from a single user action.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;(Each layer makes 1 initial attempt + 3 retries = 4 attempts. Three layers: &lt;code&gt;4^3 = 64&lt;/code&gt;.)&lt;/p&gt;

&lt;p&gt;Nobody in the stack tracks the &lt;em&gt;total&lt;/em&gt; retry count. Each layer only knows about its own attempts. I built &lt;a href="https://github.com/amabito/veronica-core" rel="noopener noreferrer"&gt;veronica-core&lt;/a&gt; to fix this -- a run-level budget that sits across all layers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The math
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Retry layers&lt;/th&gt;
&lt;th&gt;Retries per layer&lt;/th&gt;
&lt;th&gt;Worst-case calls&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;256&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;216&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is exponential, not linear. Adding one more retry layer doesn't add 3 calls -- it multiplies the total by 4.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this costs
&lt;/h2&gt;

&lt;p&gt;GPT-4o at $2.50/1M input + $10.00/1M output tokens. A typical 2K-token agent step with a 500-token response costs about $0.01.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Calls&lt;/th&gt;
&lt;th&gt;Cost per request&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No retries needed&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;$0.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3-layer retry, worst case&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;$0.64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4-layer retry, worst case&lt;/td&gt;
&lt;td&gt;256&lt;/td&gt;
&lt;td&gt;$2.56&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1000 users/day, 3-layer worst case&lt;/td&gt;
&lt;td&gt;64,000&lt;/td&gt;
&lt;td&gt;$640/day&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Most of the time you won't hit worst case. But you will hit partial amplification regularly -- 8-12 calls where 1-2 would suffice. That's a steady 4-6x cost multiplier that shows up as "the API is expensive" rather than "our retry logic is broken."&lt;/p&gt;

&lt;h2&gt;
  
  
  Why &lt;code&gt;max_iterations&lt;/code&gt; doesn't fix this
&lt;/h2&gt;

&lt;p&gt;Most agent frameworks have some form of step or iteration limit. LangChain has &lt;code&gt;max_iterations&lt;/code&gt;, others have conversation turn caps or loop counters. These limit how many &lt;em&gt;steps&lt;/em&gt; your agent takes, not how many API calls happen underneath.&lt;/p&gt;

&lt;p&gt;If an agent has &lt;code&gt;max_iterations=10&lt;/code&gt; and each iteration retries 3 times internally, you can still get 40 API calls. The step counter doesn't see the retries.&lt;/p&gt;

&lt;p&gt;These are step limits, not cost limits. None of them track how much money the run has spent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before: no containment
&lt;/h2&gt;

&lt;p&gt;This example is intentionally simplified. In real LangChain stacks, the same multiplication usually comes from a mix of provider retries, &lt;code&gt;tenacity&lt;/code&gt; decorators, tool wrappers, and chain-level retries -- which makes it hard to spot by reading any single file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Calls the LLM API. Fails sometimes.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;inner_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chain_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;inner_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;agent_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;chain_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# One user click. Up to 64 API calls. No total limit.
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;agent_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize this document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every layer is doing the right thing locally. But nobody tracks the total. No budget, no circuit breaker. If the API goes down for 30 seconds, this burns through 64 calls before giving up.&lt;/p&gt;

&lt;h2&gt;
  
  
  After: chain-level containment
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;veronica_core.containment&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ExecutionContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ExecutionConfig&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;veronica_core.shield.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Decision&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ExecutionConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;max_cost_usd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;# Hard dollar ceiling for this run
&lt;/span&gt;    &lt;span class="n"&gt;max_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;# Max successful operations
&lt;/span&gt;    &lt;span class="n"&gt;max_retries_total&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# Total retries across ALL layers
&lt;/span&gt;    &lt;span class="n"&gt;timeout_ms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;# 30-second wall clock limit
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;summarize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;ExecutionContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Returns Decision.HALT if any limit is breached,
&lt;/span&gt;        &lt;span class="c1"&gt;# or the return value of call_llm() on success.
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wrap_llm_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;Decision&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HALT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Limit breached. The LLM call was not dispatched,
&lt;/span&gt;            &lt;span class="c1"&gt;# so this blocked attempt adds no API cost.
&lt;/span&gt;            &lt;span class="n"&gt;snapshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_snapshot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;degraded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;abort_reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cost_usd_accumulated&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What this does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;max_cost_usd=0.10&lt;/code&gt; -- the entire agent run cannot spend more than 10 cents, regardless of how many layers retry.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_retries_total=5&lt;/code&gt; -- total retries across all layers combined. Not per layer. Chain-wide.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_steps=20&lt;/code&gt; -- total successful API calls. Prevents infinite tool loops.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;timeout_ms=30_000&lt;/code&gt; -- wall-clock hard stop after 30 seconds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When any limit is hit, &lt;code&gt;wrap_llm_call()&lt;/code&gt; returns &lt;code&gt;Decision.HALT&lt;/code&gt; &lt;strong&gt;without dispatching the LLM call&lt;/strong&gt;. The blocked attempt itself adds no API cost.&lt;/p&gt;

&lt;p&gt;On a stubbed call path (&lt;code&gt;benchmarks/bench_baseline_comparison.py&lt;/code&gt; in the repo), the full policy check averages around 11 microseconds. Typical LLM calls take 500-5000ms, so the containment overhead is negligible in practice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before/after
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Worst-case calls (3-layer retry)&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;6 (1 + 5 retries)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost ceiling&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;$0.10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total retry tracking&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wall-clock timeout&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Behavior when API is down&lt;/td&gt;
&lt;td&gt;Burns 64 calls, then fails&lt;/td&gt;
&lt;td&gt;Burns 5 retries, then stops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code changes to agent logic&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The agent logic doesn't change. &lt;code&gt;ExecutionContext&lt;/code&gt; wraps the calls from the outside. Your retries still work -- they just can't exceed the chain budget.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this does not do
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;veronica-core&lt;/code&gt; is a cost and execution control library. It is not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;An output validator.&lt;/strong&gt; It doesn't check what the LLM says. Use Guardrails AI or NeMo Guardrails for that.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A content filter.&lt;/strong&gt; It doesn't block harmful outputs. That's a different problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A prompt engineering tool.&lt;/strong&gt; It doesn't modify your prompts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A framework.&lt;/strong&gt; It wraps your existing LLM calls. It doesn't replace your agent framework or custom loop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A latency optimizer.&lt;/strong&gt; It doesn't make calls faster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A fix for bad prompts.&lt;/strong&gt; If your agent loops because the prompt is wrong, that's a prompt problem. This just caps the damage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It controls &lt;em&gt;how many times&lt;/em&gt; your agent calls the API and &lt;em&gt;how much money&lt;/em&gt; it spends. That's it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;veronica-core
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Python 3.10+. No required dependencies beyond the standard library.&lt;/p&gt;

&lt;p&gt;Optional extras:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;veronica-core[redis]   &lt;span class="c"&gt;# Distributed budget tracking (multi-process)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Source: &lt;a href="https://github.com/amabito/veronica-core" rel="noopener noreferrer"&gt;github.com/amabito/veronica-core&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There's also a &lt;code&gt;BudgetEnforcer&lt;/code&gt; for standalone budget tracking, a &lt;code&gt;CircuitBreaker&lt;/code&gt; for failure isolation, and ASGI/WSGI middleware if you want per-request containment in a web app. The retry amplification example above is probably the simplest place to start.&lt;/p&gt;

&lt;h2&gt;
  
  
  You can reproduce the benchmark
&lt;/h2&gt;

&lt;p&gt;The benchmark script is in the repo. It uses stub LLM implementations -- no API keys, no network calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/amabito/veronica-core
&lt;span class="nb"&gt;cd &lt;/span&gt;veronica-core
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
python benchmarks/bench_retry_amplification.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The article uses the 64-call example because it matches the common "3 retries per layer" mental model: each layer makes 1 initial attempt + 3 retries = 4 attempts, so &lt;code&gt;4^3 = 64&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The benchmark in the repo uses a simpler always-failing stub with a &lt;code&gt;3 x 3 x 3&lt;/code&gt; retry loop, which produces 27 baseline calls. Same bug, different retry convention. The benchmark shows those 27 calls reduced to 3 contained calls with &lt;code&gt;max_retries_total=5&lt;/code&gt;.&lt;/p&gt;




&lt;p&gt;Retry amplification is not a new idea. What's missing in most LLM stacks is a hard budget that applies to the entire run, not just one call at a time.&lt;/p&gt;

&lt;p&gt;If you want to see the failure mode without spending real money, run &lt;code&gt;python benchmarks/bench_retry_amplification.py&lt;/code&gt; in the repo. No API key, no network calls, and it makes the bug obvious in a few seconds.&lt;/p&gt;

</description>
      <category>python</category>
      <category>llm</category>
      <category>langchain</category>
      <category>ai</category>
    </item>
    <item>
      <title>The $12K Weekend: What Nobody Tells You About LLM Agents in Production</title>
      <dc:creator>amabito</dc:creator>
      <pubDate>Mon, 23 Feb 2026 13:40:10 +0000</pubDate>
      <link>https://dev.to/amabito/the-12k-weekend-what-nobody-tells-you-about-llm-agents-in-production-2li3</link>
      <guid>https://dev.to/amabito/the-12k-weekend-what-nobody-tells-you-about-llm-agents-in-production-2li3</guid>
      <description>&lt;p&gt;An autonomous agent ran over a weekend. By Monday it had made 47,000 API calls.&lt;/p&gt;

&lt;p&gt;No one set a budget ceiling. No one enforced a retry limit. The agent hit a transient API error, retried, hit another, retried again — and kept going for 60 hours because nothing told it to stop.&lt;/p&gt;

&lt;p&gt;I spent the first hour convinced it was a billing bug.&lt;/p&gt;

&lt;p&gt;This isn't a one-off. Simon Willison has documented the pattern. The r/MachineLearning thread from January had 800 upvotes. The numbers vary — $3K, $8K, $12K — but the shape is always the same: retry loop, no ceiling, nobody home.&lt;/p&gt;




&lt;h2&gt;
  
  
  We tried the obvious things first
&lt;/h2&gt;

&lt;p&gt;The instinct is better observability. Set up cost alerts. Wire up a dashboard. These are good things — I'm not arguing against them.&lt;/p&gt;

&lt;p&gt;But an alert fires &lt;em&gt;after&lt;/em&gt; the call happens. The call has already consumed tokens. The money is already spent. You're getting a notification about something that's over.&lt;/p&gt;

&lt;p&gt;Then we tried retry libraries. Tenacity, backoff. These handle transient failures fine — if a call fails, wait and retry. The problem is they have no concept of a dollar ceiling. And if your process crashes mid-run and auto-recovers, the retry counter resets to zero. You're back to the beginning.&lt;/p&gt;

&lt;p&gt;We spent two weeks on circuit breakers, which felt clever for a while. Trip the breaker, stop the runaway, done. Except: the breaker lives in process memory. Process dies, auto-recovery kicks in, breaker is gone. We'd solved the problem for the happy path and nothing else.&lt;/p&gt;

&lt;p&gt;Provider spend limits have a different issue — they're per-account, not per call chain. They don't propagate across sub-agents. Agent A has a $1.00 limit and spawns Agent B, which independently racks up $8. The provider limit never triggers because $9 total is nowhere near your account ceiling. Agent A never knew.&lt;/p&gt;

&lt;p&gt;The gap isn't subtle once you see it: nothing enforces bounded execution &lt;em&gt;before the call happens&lt;/em&gt;, in a way that survives process restarts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this is harder than it sounds
&lt;/h2&gt;

&lt;p&gt;LLM agents are probabilistic, cost-generating components inside systems expected to behave reliably. That's a hard contradiction and it doesn't resolve through better prompts or more careful orchestration — those operate at the wrong layer.&lt;/p&gt;

&lt;p&gt;The analogy I kept coming back to (and resisting, because it sounds grandiose) is operating systems. An OS doesn't know what your application is &lt;em&gt;doing&lt;/em&gt;. It just enforces the resource contract: this process gets X memory, Y CPU time, and when it's done, it's done. If the process tries to take more, the OS says no.&lt;/p&gt;

&lt;p&gt;LLM systems don't have that. Every agent is running on the honor system.&lt;/p&gt;

&lt;p&gt;What you actually need is something that enforces, at call time: this chain can spend at most $X, run for at most N steps, and if I crash and restart, those limits are still in effect. If I spawn a sub-agent, its costs count against my limit — not just its own.&lt;/p&gt;




&lt;h2&gt;
  
  
  What we built
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;veronica_core.containment&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ExecutionConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ExecutionContext&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;ExecutionContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ExecutionConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;max_cost_usd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_retries_total&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;timeout_ms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wrap_llm_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;my_agent_step&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Returns Decision.HALT if any limit exceeded
&lt;/span&gt;    &lt;span class="c1"&gt;# fn is never called when halted — no network request, no spend
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;wrap_llm_call&lt;/code&gt; checks the budget &lt;em&gt;before&lt;/em&gt; calling &lt;code&gt;fn&lt;/code&gt;. If the ceiling is hit, it returns &lt;code&gt;Decision.HALT&lt;/code&gt; and never makes the network request. Nothing gets spent.&lt;/p&gt;

&lt;p&gt;The multi-agent case is where this gets genuinely useful:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;ExecutionContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ExecutionConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_cost_usd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.00&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;spawn_child&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_cost_usd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wrap_llm_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sub_agent_step&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# child spend counts against parent's $1.00
&lt;/span&gt;        &lt;span class="c1"&gt;# parent halts if cumulative cost exceeds $1.00
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agent B has its own $0.50 sub-limit. But whatever B spends also comes off A's $1.00. A halts before the chain blows past $1.00 through a path nobody was watching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The halt state problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The circuit breaker issue — state disappearing on restart — we solved with atomic disk writes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;veronica_core&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VeronicaIntegration&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;veronica_core.state&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VeronicaState&lt;/span&gt;

&lt;span class="n"&gt;veronica&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;VeronicaIntegration&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;veronica&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;VeronicaState&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SAFE_MODE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;operator halt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Writes atomically to disk (tmp → rename)
# Survives kill -9. Auto-recovery does NOT clear it.
# Requires explicit .state.transition(VeronicaState.IDLE, ...) to resume
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Write to tmp, rename to target — this survives &lt;code&gt;kill -9&lt;/code&gt; because the rename is atomic at the filesystem level. Auto-recovery doesn't clear SAFE_MODE. You put it in SAFE_MODE because something went wrong; you should have to explicitly decide to resume.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When you'd rather degrade than stop&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Hard halts aren't always right. Sometimes you want the system to keep running at reduced capacity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 80% budget: downgrade to a cheaper model
# 85%: trim context
# 90%: rate limit between calls
# 100%: halt
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Thresholds and model mappings are configurable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Across processes&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ExecutionConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_cost_usd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;10.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;redis_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis://localhost:6379&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Workers share one budget ceiling via Redis INCRBYFLOAT
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Numbers
&lt;/h2&gt;

&lt;p&gt;This runs inside an autonomous trading system. 30 days continuous, 1,000+ ops/sec, 2.6M+ operations. During that time, 12 crashes — SIGTERM, SIGINT, one OOM kill. 100% recovery, no data loss.&lt;/p&gt;

&lt;p&gt;The destruction tests are reproducible. You don't have to take our word for it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/amabito/veronica-core
python scripts/proof_runner.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SAFE_MODE persistence through kill -9, budget ceiling enforcement, child cost propagation. They pass or they don't.&lt;/p&gt;




&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;veronica-core
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Zero external dependencies for core. &lt;code&gt;opentelemetry-sdk&lt;/code&gt; optional for OTel export, &lt;code&gt;redis&lt;/code&gt; optional for cross-process budget. Works with LangChain, AutoGen, CrewAI, or whatever you're building. MIT.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/amabito/veronica-core" rel="noopener noreferrer"&gt;amabito/veronica-core&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;If you've hit something like this in production — curious what the failure mode looked like and what you ended up doing about it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
