<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: George Belsky</title>
    <description>The latest articles on DEV Community by George Belsky (@george_belsky).</description>
    <link>https://dev.to/george_belsky</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1520513%2F7f48acb5-5b87-4565-ab84-bab911962b98.jpg</url>
      <title>DEV Community: George Belsky</title>
      <link>https://dev.to/george_belsky</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/george_belsky"/>
    <language>en</language>
    <item>
      <title>Your AI Agent Crashed at Step 47. Why Isn't Crash Recovery the Default?</title>
      <dc:creator>George Belsky</dc:creator>
      <pubDate>Wed, 08 Apr 2026 10:11:41 +0000</pubDate>
      <link>https://dev.to/george_belsky/your-ai-agent-crashed-at-step-47-why-isnt-crash-recovery-the-default-d95</link>
      <guid>https://dev.to/george_belsky/your-ai-agent-crashed-at-step-47-why-isnt-crash-recovery-the-default-d95</guid>
      <description>&lt;p&gt;Your agent is running a 50-step data pipeline. Extract, validate, transform, deduplicate, load. 25 minutes in.&lt;/p&gt;

&lt;p&gt;Step 47. OOM killed. Process gone. 25 minutes of work gone.&lt;/p&gt;

&lt;p&gt;You restart the agent. It starts from step 1.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "You Should Have Configured It" Problem
&lt;/h2&gt;

&lt;p&gt;Every framework has an answer for this. And every answer is the same: you should have set it up before the crash.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# LangGraph - opt-in persistence
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.checkpoint.postgres&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PostgresSaver&lt;/span&gt;

&lt;span class="n"&gt;checkpointer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PostgresSaver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_conn_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DB_URI&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;checkpointer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;checkpointer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# forgot this? start over.
&lt;/span&gt;
&lt;span class="c1"&gt;# CrewAI - limited state management
# "Failures typically require restart"
&lt;/span&gt;
&lt;span class="c1"&gt;# Swarm - no persistence at all
# State exists only in memory
&lt;/span&gt;
&lt;span class="c1"&gt;# Raw Python - hope you wrote your own
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pattern is consistent: durability is an add-on. Something you bolt on after you build the agent. Something you forget until the first crash.&lt;/p&gt;

&lt;p&gt;And the checkpoint code is never simple. With LangGraph's PostgresSaver you also manage database connections, schema migrations when LangGraph updates, cleanup of old checkpoints, serialization errors when state objects change shape, and resume logic. That's 50+ lines of infrastructure code unrelated to what your agent actually does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Durability Should Be the Default
&lt;/h2&gt;

&lt;p&gt;Think about how you use Stripe. You don't write checkpoint code in case your server crashes mid-payment. Stripe handles it - idempotency keys, retry logic, durable state on their side.&lt;/p&gt;

&lt;p&gt;Agent operations are the exception. The one place where durability is still opt-in. Still your problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent Stateless, Platform Stateful
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;axme&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AxmeClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AxmeClientConfig&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AxmeClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;AxmeClientConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AXME_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;

&lt;span class="n"&gt;intent_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_intent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent.pipeline.process.v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent://myorg/production/data-pipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;etl-customers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;load&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_rows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intent_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No &lt;code&gt;PostgresSaver&lt;/code&gt;. No checkpoint database. No serialization code.&lt;/p&gt;

&lt;p&gt;The state lives in the platform. The agent is stateless. When the agent crashes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The intent stays at its current state in PostgreSQL&lt;/li&gt;
&lt;li&gt;The agent restarts (Cloud Run, Kubernetes, whatever)&lt;/li&gt;
&lt;li&gt;The platform redelivers the intent&lt;/li&gt;
&lt;li&gt;The agent resumes from where it stopped&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Up to 3 delivery attempts by default. Configurable per intent type.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Cost of Opt-In Durability
&lt;/h2&gt;

&lt;p&gt;It's not just the code. It's the incidents. The agent that crashed at record 98k of 100k and started over. The deployment pipeline that failed at step 9, re-ran all 10, and double-deployed services 1 through 9. The enrichment job that crashed and hit the same API 50,000 times on restart.&lt;/p&gt;

&lt;p&gt;These happen not because teams are careless - but because they were busy building the product and didn't get to the checkpoint code yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;LangGraph&lt;/th&gt;
&lt;th&gt;CrewAI&lt;/th&gt;
&lt;th&gt;AXME&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Durability&lt;/td&gt;
&lt;td&gt;Opt-in (PostgresSaver)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Default&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Checkpoint code&lt;/td&gt;
&lt;td&gt;30-50 lines&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DB management&lt;/td&gt;
&lt;td&gt;You operate&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Managed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Resume after crash&lt;/td&gt;
&lt;td&gt;From last checkpoint&lt;/td&gt;
&lt;td&gt;Start over&lt;/td&gt;
&lt;td&gt;Automatic redelivery&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-machine&lt;/td&gt;
&lt;td&gt;No (state is local)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (state in platform)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Framework lock-in&lt;/td&gt;
&lt;td&gt;LangGraph only&lt;/td&gt;
&lt;td&gt;CrewAI only&lt;/td&gt;
&lt;td&gt;Any framework&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Working example - submit a multi-step pipeline, kill the agent mid-processing, restart it, watch it resume automatically:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/AxmeAI/ai-agent-checkpoint-and-resume" rel="noopener noreferrer"&gt;github.com/AxmeAI/ai-agent-checkpoint-and-resume&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built with &lt;a href="https://github.com/AxmeAI/axme" rel="noopener noreferrer"&gt;AXME&lt;/a&gt; - durable execution for agent operations. Alpha - feedback welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>agents</category>
      <category>durability</category>
    </item>
    <item>
      <title>How to Stop a Rogue AI Agent in Production</title>
      <dc:creator>George Belsky</dc:creator>
      <pubDate>Wed, 08 Apr 2026 10:11:16 +0000</pubDate>
      <link>https://dev.to/george_belsky/how-to-stop-a-rogue-ai-agent-in-production-1a4f</link>
      <guid>https://dev.to/george_belsky/how-to-stop-a-rogue-ai-agent-in-production-1a4f</guid>
      <description>&lt;p&gt;It's 3am. Your on-call phone rings. The deployment agent you launched before leaving the office has been running for 6 hours. It was supposed to deploy 3 services. It has deployed 47.&lt;/p&gt;

&lt;p&gt;You open your laptop. The agent is running on 4 Cloud Run instances. You have no way to stop it remotely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "Just Kill the Process" Doesn't Work
&lt;/h2&gt;

&lt;p&gt;Production agents are not local scripts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;They run on managed infrastructure.&lt;/strong&gt; Cloud Run, Kubernetes, Lambda. There is no PID to kill. You can scale the service to zero, but pending requests keep executing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;They run on multiple instances.&lt;/strong&gt; Your auto-scaler gave you 4 replicas. You kill one, three keep going. You need to find and kill each one individually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;There's no coordination.&lt;/strong&gt; Each instance runs independently. There's no shared "stop" signal they all check.&lt;/p&gt;

&lt;p&gt;So you scramble. Delete the Cloud Run service. Wait 60 seconds for drain. Lose all state about what was deployed and what wasn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Firewall Model
&lt;/h2&gt;

&lt;p&gt;The solution is the same one networks solved decades ago: a chokepoint.&lt;/p&gt;

&lt;p&gt;Every network packet goes through a firewall. The firewall can block traffic instantly, regardless of what the source is doing. Agent traffic works the same way when you route it through a gateway. Every intent goes through one point. Block it there, and the agent stops - even if the code has a bug, even if there are 50 instances.&lt;/p&gt;

&lt;p&gt;When you kill an agent through the AXME gateway, all inbound intents to that agent are rejected (403) and all outbound intents from it are blocked. Even if the agent process is still running, it cannot send or receive anything through the gateway. The kill is enforced at the infrastructure level - the agent code does not need to cooperate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Health Monitoring and Policies
&lt;/h2&gt;

&lt;p&gt;A kill switch is reactive. Policies are proactive.&lt;/p&gt;

&lt;p&gt;You can also set policies proactively: cost ceilings, intent rate limits, allowed action types per agent. If the deployment agent crosses $50 in API costs or sends more than 500 intents per hour, the gateway kills it automatically. No 3am phone call.&lt;/p&gt;

&lt;p&gt;Combined with heartbeat monitoring: live health status, cost tracking per agent, automatic alerting when an agent goes stale, and full audit trail for every kill, resume, and policy change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resume After Fix
&lt;/h2&gt;

&lt;p&gt;After you figure out what went wrong and fix the config, you resume the agent through the gateway. Health status resets, intents flow again, the platform redelivers any pending work.&lt;/p&gt;

&lt;h2&gt;
  
  
  DIY vs. Gateway Enforcement
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Kill the process&lt;/th&gt;
&lt;th&gt;Redis flag&lt;/th&gt;
&lt;th&gt;AXME Mesh&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Multi-instance&lt;/td&gt;
&lt;td&gt;Kill each one&lt;/td&gt;
&lt;td&gt;Agents must poll&lt;/td&gt;
&lt;td&gt;One API call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Buggy agent&lt;/td&gt;
&lt;td&gt;No cooperation possible&lt;/td&gt;
&lt;td&gt;Must check flag&lt;/td&gt;
&lt;td&gt;Gateway-enforced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Response time&lt;/td&gt;
&lt;td&gt;30-60s (drain)&lt;/td&gt;
&lt;td&gt;Depends on poll interval&lt;/td&gt;
&lt;td&gt;Under 1 second&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;State preservation&lt;/td&gt;
&lt;td&gt;Lost&lt;/td&gt;
&lt;td&gt;Custom checkpoint&lt;/td&gt;
&lt;td&gt;Durable in platform&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit trail&lt;/td&gt;
&lt;td&gt;CloudWatch logs maybe&lt;/td&gt;
&lt;td&gt;Custom logging&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Policies (auto-kill)&lt;/td&gt;
&lt;td&gt;Build it&lt;/td&gt;
&lt;td&gt;Build it&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Working example - simulate a rogue agent, kill it remotely, resume after fix:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/AxmeAI/ai-agent-kill-switch" rel="noopener noreferrer"&gt;github.com/AxmeAI/ai-agent-kill-switch&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built with &lt;a href="https://github.com/AxmeAI/axme" rel="noopener noreferrer"&gt;AXME&lt;/a&gt; - agent mesh with kill switch, policies, and health monitoring. Alpha - feedback welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>agents</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>How to Add Human Approval to AI Agent Workflows Without Building It Yourself</title>
      <dc:creator>George Belsky</dc:creator>
      <pubDate>Wed, 08 Apr 2026 10:06:18 +0000</pubDate>
      <link>https://dev.to/george_belsky/how-to-add-human-approval-to-ai-agent-workflows-without-building-it-yourself-3ld6</link>
      <guid>https://dev.to/george_belsky/how-to-add-human-approval-to-ai-agent-workflows-without-building-it-yourself-3ld6</guid>
      <description>&lt;p&gt;Your AI agent generates a quarterly financial report. Before it emails the board, a human needs to review it. Simple requirement.&lt;/p&gt;

&lt;p&gt;Here's what you actually have to build.&lt;/p&gt;

&lt;h2&gt;
  
  
  The DIY Approach
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;secrets&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;smtplib&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apscheduler.schedulers.background&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BackgroundScheduler&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;request_approval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reviewer_email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Generate approval token
&lt;/span&gt;    &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;token_urlsafe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;approvals&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pending&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Send notification
&lt;/span&gt;    &lt;span class="nf"&gt;send_slack_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reviewer_email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Approval needed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Approve: https://your-app.com/approve/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reject: https://your-app.com/reject/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Schedule reminder (5 min)
&lt;/span&gt;    &lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;send_reminder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;run_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reviewer_email&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# 4. Schedule escalation (30 min)
&lt;/span&gt;    &lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;escalate_to_backup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;run_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;get_backup_reviewer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reviewer_email&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;

    &lt;span class="c1"&gt;# 5. Schedule timeout (8 hours)
&lt;/span&gt;    &lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handle_timeout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;run_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hours&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;

&lt;span class="c1"&gt;# Plus you need:
# - Webhook endpoint for approve/reject callbacks
# - Token validation and expiry
# - Polling loop or callback for the agent to resume
# - Audit logging (who approved, when, what context)
# - DB cleanup for expired tokens
# - Error handling for failed notifications
# - Unit tests for all of the above
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's about 200 lines before error handling. You also need a web server for the webhook, a scheduler process that stays alive, and a database for approval state.&lt;/p&gt;

&lt;p&gt;All you wanted was "pause and wait for a human."&lt;/p&gt;

&lt;h2&gt;
  
  
  The 4-Line Version
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;axme&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AxmeClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AxmeClientConfig&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AxmeClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;AxmeClientConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AXME_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;

&lt;span class="n"&gt;intent_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_intent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent.report.review_approval.v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent://myorg/production/report-generator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;report&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Q1 Financial Summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pii_detected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reviewer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cfo@company.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intent_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The platform handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Notification&lt;/strong&gt; - Slack, email, CLI. Reviewer gets notified immediately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reminders&lt;/strong&gt; - Configurable intervals. Default: 5 min, then 30 min.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Escalation&lt;/strong&gt; - Reviewer A does not respond? Escalate to reviewer B, then to the team.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timeout&lt;/strong&gt; - Graceful timeout with configurable fallback action.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit trail&lt;/strong&gt; - Who approved, when, with what context. Stored durably.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Durable state&lt;/strong&gt; - Agent crashes? Restarts? The approval state is in PostgreSQL, not in process memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Matters in Production
&lt;/h2&gt;

&lt;p&gt;The demo version of human approval is always simple. &lt;code&gt;input("Approve? y/n")&lt;/code&gt;. The production version is where things break.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Production concern&lt;/th&gt;
&lt;th&gt;DIY&lt;/th&gt;
&lt;th&gt;AXME&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Human is on vacation&lt;/td&gt;
&lt;td&gt;Build escalation chain&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent crashes while waiting&lt;/td&gt;
&lt;td&gt;Lost approval state&lt;/td&gt;
&lt;td&gt;Durable in DB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Two approvals needed&lt;/td&gt;
&lt;td&gt;Build chaining logic&lt;/td&gt;
&lt;td&gt;Approval chains&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit for compliance&lt;/td&gt;
&lt;td&gt;Custom logging&lt;/td&gt;
&lt;td&gt;Built-in event log&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reminder if no response in 5 min&lt;/td&gt;
&lt;td&gt;Scheduler + cron job&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mobile-friendly approval&lt;/td&gt;
&lt;td&gt;Build a UI&lt;/td&gt;
&lt;td&gt;Slack/email/CLI&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Works With Any Framework
&lt;/h2&gt;

&lt;p&gt;This is not framework-specific. Your agent can be built with LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Google ADK, or raw Python. The approval layer sits outside your agent code.&lt;/p&gt;

&lt;p&gt;The agent framework handles reasoning. The coordination layer handles waiting for humans.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Working example - agent generates a report, pauses for human review, resumes after approval:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/AxmeAI/async-human-approval-for-ai-agents" rel="noopener noreferrer"&gt;github.com/AxmeAI/async-human-approval-for-ai-agents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built with &lt;a href="https://github.com/AxmeAI/axme" rel="noopener noreferrer"&gt;AXME&lt;/a&gt; - human approval for AI agent workflows, built in. Alpha - feedback welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>agents</category>
      <category>humanintheloop</category>
    </item>
    <item>
      <title>Temporal Alternative Without the Cluster and Determinism Constraints</title>
      <dc:creator>George Belsky</dc:creator>
      <pubDate>Wed, 08 Apr 2026 10:05:54 +0000</pubDate>
      <link>https://dev.to/george_belsky/temporal-alternative-without-the-cluster-and-determinism-constraints-38dj</link>
      <guid>https://dev.to/george_belsky/temporal-alternative-without-the-cluster-and-determinism-constraints-38dj</guid>
      <description>&lt;p&gt;Temporal is the gold standard for durable execution. If you need long-running workflows that survive crashes, it's the first thing most teams evaluate.&lt;/p&gt;

&lt;p&gt;But then you read the docs. And you discover what Temporal actually requires.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cluster Problem
&lt;/h2&gt;

&lt;p&gt;Temporal needs a cluster. Either you run it yourself (Temporal Server + Cassandra/PostgreSQL + Elasticsearch) or you pay for Temporal Cloud.&lt;/p&gt;

&lt;p&gt;Self-hosted means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Temporal Server (3+ nodes for HA)&lt;/li&gt;
&lt;li&gt;Cassandra or PostgreSQL for persistence&lt;/li&gt;
&lt;li&gt;Elasticsearch for visibility&lt;/li&gt;
&lt;li&gt;Monitoring, upgrades, schema migrations&lt;/li&gt;
&lt;li&gt;A team that understands Temporal internals when something breaks at 2am&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is fine if you're Uber. If you're a team of 5 building an AI agent pipeline, it's a lot of infrastructure for "I want my workflow to survive a crash."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Determinism Problem
&lt;/h2&gt;

&lt;p&gt;Temporal replays your workflow code on every restart. This means your workflow functions must be deterministic. No side effects.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# These all break Temporal workflows:
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;
&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       &lt;span class="c1"&gt;# non-deterministic
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;                &lt;span class="c1"&gt;# different on replay
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# side effect
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every developer on the team needs to learn this. New hire writes &lt;code&gt;datetime.now()&lt;/code&gt; in a workflow, the replay breaks in production, and nobody understands why until someone reads the Temporal determinism docs.&lt;/p&gt;

&lt;p&gt;Activities solve this - you put non-deterministic code in activities. But that means restructuring your code around Temporal's execution model. Your agent code now has to know it's running inside Temporal.&lt;/p&gt;

&lt;h2&gt;
  
  
  What If You Just Didn't
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;axme&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AxmeClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AxmeClientConfig&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AxmeClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;AxmeClientConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AXME_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;

&lt;span class="n"&gt;intent_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_intent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent.pipeline.process.v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent://myorg/production/data-pipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;load&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgres-main&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;destination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warehouse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intent_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No cluster. No determinism constraints. Write normal Python. Call &lt;code&gt;datetime.now()&lt;/code&gt; all you want.&lt;/p&gt;

&lt;p&gt;The state lives in the platform (managed PostgreSQL). Your agent is stateless. If it crashes, the platform redelivers the intent. If it needs human approval mid-workflow, the platform handles the wait.&lt;/p&gt;

&lt;h2&gt;
  
  
  Side-by-Side
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Temporal&lt;/th&gt;
&lt;th&gt;AXME&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;Cluster (self-hosted or Cloud)&lt;/td&gt;
&lt;td&gt;Managed API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Determinism constraints&lt;/td&gt;
&lt;td&gt;Required for workflow code&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Learning curve&lt;/td&gt;
&lt;td&gt;Weeks (activities, signals, queries, replay)&lt;/td&gt;
&lt;td&gt;Hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Human approval&lt;/td&gt;
&lt;td&gt;Build it (signals + UI + notifications)&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Crash recovery&lt;/td&gt;
&lt;td&gt;Replay-based (determinism required)&lt;/td&gt;
&lt;td&gt;Redelivery-based (stateless agent)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Setup time&lt;/td&gt;
&lt;td&gt;Days-weeks&lt;/td&gt;
&lt;td&gt;&lt;code&gt;pip install axme&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  When Temporal Is Still the Right Choice
&lt;/h2&gt;

&lt;p&gt;Temporal is better when you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complex compensation logic (sagas with rollbacks across 10 services)&lt;/li&gt;
&lt;li&gt;A dedicated platform team to operate the cluster&lt;/li&gt;
&lt;li&gt;Workflows with hundreds of steps and complex branching&lt;/li&gt;
&lt;li&gt;Existing investment in the Temporal ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your use case is "durable execution for agent operations with human approval" - you don't need a workflow engine. You need a coordination layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Working example - durable multi-step pipeline with crash recovery, no cluster, no determinism constraints:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/AxmeAI/durable-execution-with-human-approval" rel="noopener noreferrer"&gt;github.com/AxmeAI/durable-execution-with-human-approval&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built with &lt;a href="https://github.com/AxmeAI/axme" rel="noopener noreferrer"&gt;AXME&lt;/a&gt; - durable execution without the cluster. Alpha - feedback welcome.&lt;/p&gt;

</description>
      <category>temporal</category>
      <category>python</category>
      <category>durableexecution</category>
      <category>workflow</category>
    </item>
    <item>
      <title>You Deployed 30 AI Agents. Can You Answer These 5 Questions About Them?</title>
      <dc:creator>George Belsky</dc:creator>
      <pubDate>Sun, 05 Apr 2026 07:12:20 +0000</pubDate>
      <link>https://dev.to/george_belsky/you-deployed-30-ai-agents-can-you-answer-these-5-questions-about-them-41cd</link>
      <guid>https://dev.to/george_belsky/you-deployed-30-ai-agents-can-you-answer-these-5-questions-about-them-41cd</guid>
      <description>&lt;p&gt;Your company has 30 AI agents in production. The data analyst agent runs SQL queries. The report generator writes weekly summaries. The code reviewer comments on PRs. The customer support agent handles tickets.&lt;/p&gt;

&lt;p&gt;They all work. Individually.&lt;/p&gt;

&lt;p&gt;Now answer these five questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Which agents are running right now?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How much has each agent spent today?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Has any agent used a tool it shouldn't have?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Can you shut down a specific agent in under 10 seconds?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What did each agent do in the last 24 hours?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you can't answer all five, you don't have governance. You have 30 independent processes running in the dark.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters at Agent #10
&lt;/h2&gt;

&lt;p&gt;Teams with 1-3 agents don't feel this pain. You know where they run. You check the OpenAI dashboard manually. You grep the logs when something breaks.&lt;/p&gt;

&lt;p&gt;At 10 agents, cracks appear. An agent starts burning tokens on a loop. You don't notice for 3 hours. The monthly bill spikes. Nobody knows which agent caused it.&lt;/p&gt;

&lt;p&gt;At 30 agents, it's chaos. Different teams own different agents. Different frameworks (LangGraph, CrewAI, AutoGen). Different models (GPT-4o, Claude, Gemini). Different machines. The report-writing agent has access to the &lt;code&gt;delete_table&lt;/code&gt; function because nobody set up tool permissions. The code reviewer agent hit a bug and has been retrying the same API call for 6 hours.&lt;/p&gt;

&lt;p&gt;This is the governance gap. The agents work. Nobody governs them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Governance Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;Governance for AI agents is not a single feature. It's five capabilities working together:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Agent Registry
&lt;/h3&gt;

&lt;p&gt;Every agent registers with metadata: what team owns it, what framework it uses, what model it runs, what environment it's deployed in.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;axme&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AxmeClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AxmeClientConfig&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AxmeClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;AxmeClientConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AXME_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_intent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent.governance.register_agent.v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent://myorg/production/data-analyst&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_address&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data-analyst&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;display_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data Analyst Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;team&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analytics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;framework&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;langchain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;environment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;policies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost_cap_usd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;50.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;allowed_tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sql_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chart_generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;export_csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;require_approval_above_usd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;25.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you have an inventory. You know what's deployed, who owns it, and what rules it follows.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Health Monitoring
&lt;/h3&gt;

&lt;p&gt;Every agent sends heartbeats. If an agent misses 3 heartbeats, it's flagged as unhealthy. No more discovering failures from customer complaints.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_intent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent.governance.heartbeat.v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent://myorg/governance/monitor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_address&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data-analyst&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;healthy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metrics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;requests_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;142&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost_usd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;12.50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memory_mb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;312&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Cost Caps and Tool Permissions
&lt;/h3&gt;

&lt;p&gt;Each agent has a cost cap and a tool allowlist. The policy enforcer watches heartbeats and blocks violations in real time.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data analyst: $50/day cap, can only use &lt;code&gt;sql_query&lt;/code&gt;, &lt;code&gt;chart_generate&lt;/code&gt;, &lt;code&gt;export_csv&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Report generator: $30/day cap, can only use &lt;code&gt;read_file&lt;/code&gt;, &lt;code&gt;write_report&lt;/code&gt;, &lt;code&gt;send_email&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Code reviewer: $100/day cap, can only use &lt;code&gt;read_repo&lt;/code&gt;, &lt;code&gt;post_comment&lt;/code&gt;, &lt;code&gt;approve_pr&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When the report generator tries to call &lt;code&gt;delete_table&lt;/code&gt;: blocked, logged, alert sent. When the code reviewer hits $80 of its $100 cap: warning. When it hits $100: kill switch.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Kill Switch
&lt;/h3&gt;

&lt;p&gt;One command shuts down a single agent or the entire fleet.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Kill one agent&lt;/span&gt;
python kill_switch.py &lt;span class="nt"&gt;--agent&lt;/span&gt; data-analyst &lt;span class="nt"&gt;--reason&lt;/span&gt; &lt;span class="s2"&gt;"cost cap exceeded"&lt;/span&gt;

&lt;span class="c"&gt;# Kill everything&lt;/span&gt;
python kill_switch.py &lt;span class="nt"&gt;--all&lt;/span&gt; &lt;span class="nt"&gt;--reason&lt;/span&gt; &lt;span class="s2"&gt;"security incident"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The kill intent is durable. If the agent is temporarily unreachable, the intent waits in the platform and delivers when the agent reconnects. You don't need SSH access. You don't need to find the PID. You don't need to know which machine the agent is on.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Audit Trail
&lt;/h3&gt;

&lt;p&gt;Every governance event is logged: registrations, heartbeats, policy violations, tool blocks, kill switch activations. When the CEO asks "what happened yesterday?", you have the answer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[2026-03-31T14:20:12Z] cost_warning
  Agent:  gov-report-generator
  Cost:   $24.50 / $30.00

[2026-03-31T14:21:45Z] tool_blocked
  Agent:  gov-data-analyst
  Tool:   delete_table
  Allowed: ['sql_query', 'chart_generate', 'export_csv']

[2026-03-31T14:22:08Z] kill_switch_activated
  Agents: [data-analyst, report-generator, code-reviewer]
  Reason: security incident
  Operator: admin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Dashboard
&lt;/h2&gt;

&lt;p&gt;All five capabilities feed into a real-time fleet dashboard at &lt;a href="https://mesh.axme.ai" rel="noopener noreferrer"&gt;mesh.axme.ai&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4fytpyxkis9z8532uc7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4fytpyxkis9z8532uc7.png" alt="Agent Mesh Dashboard"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Health, cost, latency, policy compliance - all in one view. No spreadsheets. No log parsing. No monthly invoice surprises.&lt;/p&gt;

&lt;p&gt;Policies - cost caps, tool permissions, rate limits - are managed from the same interface:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkyhfjdk1c4ecv3q9z6p7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkyhfjdk1c4ecv3q9z6p7.png" alt="Policies"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Replaces
&lt;/h2&gt;

&lt;p&gt;Without a governance platform, teams build these pieces ad hoc:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Health monitoring&lt;/strong&gt;: custom cron job pinging each agent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost tracking&lt;/strong&gt;: parse OpenAI/Anthropic invoices at month end&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool permissions&lt;/strong&gt;: trust that developers configured it correctly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kill switch&lt;/strong&gt;: SSH into the server, find the PID, &lt;code&gt;kill -9&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit trail&lt;/strong&gt;: grep CloudWatch logs across 12 services&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dashboard&lt;/strong&gt;: spreadsheet updated weekly by hand&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's 6 systems, built separately, maintained by different teams, with no shared view. AXME replaces all of it with one governance layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Framework-Agnostic
&lt;/h2&gt;

&lt;p&gt;This works with any agent framework. AXME governance wraps around your existing agents - it doesn't replace them.&lt;/p&gt;

&lt;p&gt;Your LangGraph agent keeps its graph. Your CrewAI crew keeps its tasks. Your AutoGen agents keep their conversations. AXME adds the governance layer on top: register, heartbeat, obey policies, accept kill switch.&lt;/p&gt;

&lt;p&gt;The agents don't need to know about each other. The governance platform knows about all of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Full working example with fleet registration, heartbeat monitoring, policy enforcement, kill switch, audit trail, and dashboard:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/AxmeAI/ai-agent-governance-platform" rel="noopener noreferrer"&gt;github.com/AxmeAI/ai-agent-governance-platform&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built with &lt;a href="https://github.com/AxmeAI/axme" rel="noopener noreferrer"&gt;AXME&lt;/a&gt; - governance and coordination infrastructure for production AI agents. Alpha - feedback welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>governance</category>
      <category>agents</category>
      <category>enterprise</category>
    </item>
    <item>
      <title>Your AI Agent Made 10,000 API Calls in an Hour. Here's How to Stop That.</title>
      <dc:creator>George Belsky</dc:creator>
      <pubDate>Sun, 05 Apr 2026 07:11:57 +0000</pubDate>
      <link>https://dev.to/george_belsky/your-ai-agent-made-10000-api-calls-in-an-hour-heres-how-to-stop-that-3679</link>
      <guid>https://dev.to/george_belsky/your-ai-agent-made-10000-api-calls-in-an-hour-heres-how-to-stop-that-3679</guid>
      <description>&lt;p&gt;You deploy an AI agent. It processes orders. It works fine for a week.&lt;/p&gt;

&lt;p&gt;Then an upstream API starts returning intermittent 500s. The agent retries. And retries. And retries. There is no backoff cap. There is no rate limit. There is no cost ceiling.&lt;/p&gt;

&lt;p&gt;By the time someone checks the dashboard, the agent has made 10,000 API calls in an hour. LLM costs are $130 and climbing. The upstream API has rate-limited your entire API key, so now every other agent in your system is also failing.&lt;/p&gt;

&lt;p&gt;This is not a hypothetical. This is what happens when AI agents have no centralized rate control.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Agent Rate Limiting Is Different
&lt;/h2&gt;

&lt;p&gt;Traditional rate limiting protects your API from external callers. Agent rate limiting is the opposite - it protects external APIs (and your budget) from your own agents.&lt;/p&gt;

&lt;p&gt;The difference matters because:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traditional rate limiting&lt;/strong&gt; - you control the server. You add middleware. You return 429. Done.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent rate limiting&lt;/strong&gt; - you control the client. The agent makes outbound calls. There is no middleware layer between your agent and the APIs it calls. Unless you build one.&lt;/p&gt;

&lt;p&gt;Most teams don't build one. They add &lt;code&gt;time.sleep(1)&lt;/code&gt; between calls and call it rate limiting. That works until:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The agent spawns sub-agents that each have their own sleep timers&lt;/li&gt;
&lt;li&gt;Multiple agents share the same API key&lt;/li&gt;
&lt;li&gt;Retry loops override the sleep timers&lt;/li&gt;
&lt;li&gt;Nobody is tracking total cost across all agents&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What You Actually End Up Building
&lt;/h2&gt;

&lt;p&gt;If you take rate limiting seriously, you end up with something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rate_limited_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Hourly limit
&lt;/span&gt;    &lt;span class="n"&gt;hour_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rate:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%Y%m%d%H&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;hourly_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;incr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hour_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hour_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;hourly_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RateLimitExceeded&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hourly limit: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;hourly_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/200&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Daily limit
&lt;/span&gt;    &lt;span class="n"&gt;day_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rate:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%Y%m%d&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;daily_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;incr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;day_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;day_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;86400&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;daily_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RateLimitExceeded&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Daily limit: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;daily_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/2000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Cost tracking (need a separate cost accumulator)
&lt;/span&gt;    &lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;estimate_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cost_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%Y%m%d&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;current_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_cost&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;10.00&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;CostLimitExceeded&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Daily cost: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;current_cost&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/$10.00&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;incrbyfloat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;86400&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Redis. Two key patterns. Cost estimation. Expiry management. And this is the simplified version that handles one agent. Now multiply by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Per-agent policies (some agents get 200/hour, others get 5,000)&lt;/li&gt;
&lt;li&gt;Multiple breach actions (block vs alert vs require approval)&lt;/li&gt;
&lt;li&gt;A dashboard so ops can see current usage&lt;/li&gt;
&lt;li&gt;An audit trail for cost attribution&lt;/li&gt;
&lt;li&gt;Alerting when agents approach limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is 2-3 weeks of work that has nothing to do with your product.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Should Look Like
&lt;/h2&gt;

&lt;p&gt;Set a cost policy on the agent. One API call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AXME_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;base_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://cloud.axme.ai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;agent_address&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent://myorg/production/order-processor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/v1/mesh/agents/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent_address&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/policies/cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_intents_per_hour&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_intents_per_day&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_cost_per_day_usd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;10.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action_on_breach&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;block&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the entire rate limiting implementation. No Redis. No key expiry logic. No cost accumulator.&lt;/p&gt;

&lt;p&gt;When the agent exceeds any limit, the gateway returns 429 with a &lt;code&gt;Retry-After&lt;/code&gt; header. The agent stops. The other agents on the same workspace keep running because the limit is per-agent, not per-key.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Limits
&lt;/h2&gt;

&lt;p&gt;AXME cost policies support three dimensions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Limit&lt;/th&gt;
&lt;th&gt;What it controls&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_intents_per_hour&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Rolling hourly intent count per agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_intents_per_day&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Calendar day intent count per agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_cost_per_day_usd&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Estimated USD spend per agent per day&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each is optional. Set one, two, or all three.&lt;/p&gt;

&lt;h2&gt;
  
  
  Breach Actions
&lt;/h2&gt;

&lt;p&gt;When a limit is hit, you choose what happens:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;block&lt;/code&gt;&lt;/strong&gt; - Gateway returns 429. Agent cannot send more intents until the window resets. This is the hard stop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;alert&lt;/code&gt;&lt;/strong&gt; - Intent is delivered, but an alert fires. Use this when you want visibility without disruption. Good for observing normal patterns before setting hard limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;require_approval&lt;/code&gt;&lt;/strong&gt; - Intent is held in a pending state. A human must approve it before delivery continues. Use this for high-cost operations where you want a human checkpoint.&lt;/p&gt;

&lt;h2&gt;
  
  
  Timeline: Without vs With
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Without rate limiting:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;09:00  Agent processes 50 orders (normal)
09:15  Upstream API returns 500s intermittently
09:16  Agent retries aggressively (no backoff cap)
09:30  5,000 API calls. $47 in LLM costs.
09:45  12,000 API calls. $130 in costs.
09:45  Upstream rate-limits your API key.
09:45  All other agents start failing.
11:00  Someone finally notices the dashboard is red.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With AXME cost policy:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;09:00  Agent processes 50 orders (normal)
09:15  Upstream API returns 500s intermittently
09:16  Agent retries aggressively
09:16  200 intents/hour limit reached. Gateway returns 429.
09:16  Agent stops. Alert fires. $0.80 spent.
09:16  All other agents continue working normally.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The difference: $130 and a system-wide outage vs $0.80 and one agent paused for an hour.&lt;/p&gt;

&lt;h2&gt;
  
  
  Checking Usage
&lt;/h2&gt;

&lt;p&gt;You can query the current policy and usage at any time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/v1/mesh/agents/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent_address&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/policies/cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;policy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;policy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hourly limit: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;max_intents_per_hour&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Daily limit:  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;max_intents_per_day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cost cap:     $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;max_cost_per_day_usd&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or use the dashboard at &lt;a href="https://mesh.axme.ai" rel="noopener noreferrer"&gt;mesh.axme.ai&lt;/a&gt; for real-time counters across all agents:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4fytpyxkis9z8532uc7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4fytpyxkis9z8532uc7.png" alt="Agent Mesh Dashboard"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Rate and cost policies are configured alongside agent health:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkyhfjdk1c4ecv3q9z6p7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkyhfjdk1c4ecv3q9z6p7.png" alt="Policies"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pattern
&lt;/h2&gt;

&lt;p&gt;Rate limiting for AI agents is not the same as rate limiting for APIs. Your agents are the callers, not the receivers. You need the limit enforced between your agents and the outside world - at the gateway.&lt;/p&gt;

&lt;p&gt;That is what AXME cost policies do. One API call sets the limits. The gateway enforces them. The dashboard shows usage. The audit trail records breaches.&lt;/p&gt;

&lt;p&gt;No Redis. No cron jobs. No custom middleware.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Working example with policy setup, agent, and rate-limit trigger:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/AxmeAI/ai-agent-rate-limiting" rel="noopener noreferrer"&gt;github.com/AxmeAI/ai-agent-rate-limiting&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built with &lt;a href="https://github.com/AxmeAI/axme" rel="noopener noreferrer"&gt;AXME&lt;/a&gt; - rate limiting, cost caps, and usage policies built into the agent mesh. Alpha - feedback welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>ratelimiting</category>
      <category>python</category>
      <category>agents</category>
    </item>
    <item>
      <title>Your AI Agent Stopped Responding 2 Hours Ago. Nobody Noticed.</title>
      <dc:creator>George Belsky</dc:creator>
      <pubDate>Sun, 05 Apr 2026 07:11:30 +0000</pubDate>
      <link>https://dev.to/george_belsky/your-ai-agent-stopped-responding-2-hours-ago-nobody-noticed-5340</link>
      <guid>https://dev.to/george_belsky/your-ai-agent-stopped-responding-2-hours-ago-nobody-noticed-5340</guid>
      <description>&lt;p&gt;Your agent is deployed. Pod is running. Container passes liveness probes. Grafana shows a flat green line. Everything looks fine.&lt;/p&gt;

&lt;p&gt;Except the agent stopped processing work 2 hours ago. It's alive - the process is there - but it's stuck. Deadlocked on a thread. Blocked on a full queue. Spinning in a retry loop that will never succeed. Silently swallowing exceptions in a &lt;code&gt;while True&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Nobody knows until a customer reports it. Or until someone opens a dashboard at 5 PM and wonders why the task queue has been growing all afternoon.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Container Health Checks Don't Work for Agents
&lt;/h2&gt;

&lt;p&gt;Kubernetes liveness probes check one thing: is the process responding to HTTP? If your agent serves a &lt;code&gt;/healthz&lt;/code&gt; endpoint, the probe passes. The agent is "healthy."&lt;/p&gt;

&lt;p&gt;But responding to &lt;code&gt;/healthz&lt;/code&gt; and processing work are two different things. An agent can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deadlock on an internal lock while still serving HTTP&lt;/li&gt;
&lt;li&gt;OOM-kill its worker thread while the main thread stays alive&lt;/li&gt;
&lt;li&gt;Enter an infinite retry loop on a broken downstream API&lt;/li&gt;
&lt;li&gt;Silently drop into a &lt;code&gt;except: pass&lt;/code&gt; branch and stop doing anything&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The process is running. The container is green. The agent is useless.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Container health check:  "Is the process alive?"       YES
What you actually need:  "Is the agent doing work?"    NO
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This gap exists because container orchestration was designed for stateless web servers, not for long-running agents that hold state, maintain connections, and process work asynchronously.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Heartbeat Pattern
&lt;/h2&gt;

&lt;p&gt;The fix is old. Web services solved this 15 years ago with heartbeat monitoring. The idea is simple: the agent periodically reports "I am alive and working." If the report stops, something is wrong.&lt;/p&gt;

&lt;p&gt;The difference between a health check and a heartbeat: health checks are passive (something pings you), heartbeats are active (you report out). A stuck agent can't respond to pings, but a stuck agent also can't send heartbeats. That's the point.&lt;/p&gt;

&lt;p&gt;But building heartbeat infrastructure for agents means:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# 1. Heartbeat sender (added to every agent)
import threading, time, requests

def heartbeat_loop(agent_id, interval=30):
    while True:
        try:
            requests.post(
                "https://monitoring.internal/heartbeat",
                json={"agent_id": agent_id, "ts": time.time()},
                timeout=5,
            )
        except Exception:
            pass
        time.sleep(interval)

threading.Thread(target=heartbeat_loop, args=("my-agent",), daemon=True).start()

# 2. Heartbeat checker (separate cron process)
# 3. Redis/Postgres for heartbeat storage
# 4. Alerting rules (Slack, PagerDuty)
# 5. Dashboard showing last-seen times
# 6. Logic to distinguish "stopped intentionally" from "crashed"
# 7. Cleanup for deregistered agents
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;That's a monitoring system. For each agent framework you use, for each deployment environment, maintained forever.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Line Instead
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from axme import AxmeClient, AxmeClientConfig
import os

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))
client.mesh.start_heartbeat()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;That's it. A daemon thread wakes up every 30 seconds, sends a heartbeat to the platform, and goes back to sleep. When the agent stops - crash, deadlock, OOM, network partition - the heartbeats stop. The platform notices.&lt;/p&gt;

&lt;p&gt;No Redis. No cron. No Prometheus. No webhook integrations. No alerting rules to maintain.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Health Is Computed
&lt;/h2&gt;

&lt;p&gt;The platform tracks the timestamp of each heartbeat and computes health automatically:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time Since Last Heartbeat&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;What It Means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 90 seconds&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;healthy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent is alive and reporting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;90 - 300 seconds&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;degraded&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent may be stuck or overloaded&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;gt; 300 seconds&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;unreachable&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent is down or not reporting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Manual kill&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;killed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Operator explicitly blocked this agent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The thresholds are designed around the 30-second default interval. A healthy agent with &lt;code&gt;interval_seconds=30&lt;/code&gt; sends a heartbeat every 30 seconds. If the platform hasn't heard from it in 90 seconds (3 missed heartbeats), something is probably wrong. If 5 minutes pass, it's gone.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;degraded&lt;/code&gt; state is the useful one. It's the early warning. The agent isn't dead yet, but it's missed a couple of beats. Maybe the event loop is under load. Maybe a GC pause ate 45 seconds. Maybe the network is flaky. You have a window to investigate before the agent goes fully unreachable.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Happens When an Agent Goes Down
&lt;/h2&gt;

&lt;p&gt;Here's the timeline with heartbeat monitoring:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;00:00  Agent starts. Heartbeat begins.
00:30  Heartbeat sent. Status: healthy.
01:00  Heartbeat sent. Status: healthy.
01:15  Agent deadlocks on a database connection pool.
01:30  No heartbeat. (Agent is stuck, can't send.)
02:00  No heartbeat for 90s. Status: healthy -&amp;gt; degraded.
02:00  Platform logs state transition.
05:15  No heartbeat for 300s. Status: degraded -&amp;gt; unreachable.
05:15  Platform blocks new intent delivery to this agent.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Without heartbeat monitoring:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;00:00  Agent starts.
01:15  Agent deadlocks.
...
...
03:15  Someone notices the task queue growing.
03:30  Engineer SSHs in. "The process is running."
03:45  "The container is green. Logs look... wait, no new logs since 1:15."
04:00  Engineer restarts the agent.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The difference: 2 minutes vs 2.75 hours. And the first scenario is automatic - no human needs to notice anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Heartbeat with Metrics
&lt;/h2&gt;

&lt;p&gt;The heartbeat isn't just a ping. It can carry operational metrics, flushed automatically with each beat:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;client.mesh.start_heartbeat(include_metrics=True)

# As the agent processes work, report metrics
client.mesh.report_metric(success=True, latency_ms=234.5, cost_usd=0.003)
client.mesh.report_metric(success=False, latency_ms=5012.0)

# Metrics are buffered in memory and sent with the next heartbeat
# No separate metrics pipeline needed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Every 30 seconds, the heartbeat sends both "I'm alive" and "here's how I'm doing" - success rate, average latency, cost accumulation. The platform aggregates per agent and exposes it through the CLI and dashboard.&lt;/p&gt;

&lt;p&gt;This turns the heartbeat from a binary alive/dead signal into a continuous health signal. An agent that's alive but processing tasks at 20x normal latency shows up before it becomes a problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Kill and Resume
&lt;/h2&gt;

&lt;p&gt;Sometimes an agent needs to be stopped. Not crashed - intentionally blocked. Maybe it's misbehaving. Maybe you're doing maintenance. Maybe it's burning through your API budget.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# From code (address_id from list_agents)
client.mesh.kill("addr_abc123")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;A killed agent enters the &lt;code&gt;killed&lt;/code&gt; state. Even if its heartbeat thread is still running, the gateway keeps it killed. No intents are delivered. It stays killed until explicitly resumed:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;client.mesh.resume("addr_abc123")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Or kill/resume from the dashboard at &lt;a href="https://mesh.axme.ai" rel="noopener noreferrer"&gt;mesh.axme.ai&lt;/a&gt; with one click.&lt;/p&gt;

&lt;p&gt;This is different from the agent crashing. A crash leads to &lt;code&gt;unreachable&lt;/code&gt;. A kill is deliberate. The distinction matters for alerting - you don't want to page on-call for an agent you intentionally stopped.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fleet Visibility
&lt;/h2&gt;

&lt;p&gt;When you have 20 agents across 4 machines, the dashboard matters more than any individual heartbeat.&lt;/p&gt;

&lt;p&gt;The AXME Mesh Dashboard at &lt;a href="https://mesh.axme.ai" rel="noopener noreferrer"&gt;mesh.axme.ai&lt;/a&gt; shows complete fleet health in real time:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4fytpyxkis9z8532uc7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4fytpyxkis9z8532uc7.png" alt="Agent Mesh Dashboard" width="800" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Open it with:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;axme mesh dashboard
report-generator         killed        (manual)

Summary: 2 healthy, 1 degraded, 1 unreachable, 1 killed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;One command. Complete fleet health. No SSH. No Grafana. No log aggregation pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Cost of Silent Failures
&lt;/h2&gt;

&lt;p&gt;Every team running agents at scale has the same story. An agent went down on Friday afternoon. Nobody noticed until Monday morning. 60 hours of missed processing. Customer complaints. Backlog that took another 8 hours to clear.&lt;/p&gt;

&lt;p&gt;The fix isn't complicated. It's one function call. The hard part is remembering that containers passing health checks is not the same as agents doing work.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;client.mesh.start_heartbeat()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;That's the whole fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Working example - start an agent with heartbeat, kill the process, watch the status transition from healthy to degraded to unreachable:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/AxmeAI/ai-agent-heartbeat-monitoring" rel="noopener noreferrer"&gt;github.com/AxmeAI/ai-agent-heartbeat-monitoring&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built with &lt;a href="https://github.com/AxmeAI/axme" rel="noopener noreferrer"&gt;AXME&lt;/a&gt; - heartbeat, health detection, and fleet monitoring for AI agents. Alpha - feedback welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>monitoring</category>
      <category>python</category>
      <category>agents</category>
    </item>
    <item>
      <title>You Have 50 AI Agents Running. Can You Name Them All?</title>
      <dc:creator>George Belsky</dc:creator>
      <pubDate>Sun, 05 Apr 2026 07:10:59 +0000</pubDate>
      <link>https://dev.to/george_belsky/you-have-50-ai-agents-running-can-you-name-them-all-gjm</link>
      <guid>https://dev.to/george_belsky/you-have-50-ai-agents-running-can-you-name-them-all-gjm</guid>
      <description>&lt;p&gt;Last Tuesday at 2am, an agent burned through $400 in OpenAI credits. Nobody noticed until the invoice arrived.&lt;/p&gt;

&lt;p&gt;It was a research agent. One of about 40 running across three clouds. Someone had deployed it with a retry loop that never backed off. It hit rate limits, waited, retried, hit limits again -- for 11 hours straight.&lt;/p&gt;

&lt;p&gt;The team lead asked a simple question: "How many agents do we have running right now, and what are they doing?"&lt;/p&gt;

&lt;p&gt;Nobody could answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Spreadsheet Phase
&lt;/h2&gt;

&lt;p&gt;Every team goes through this. You start with one agent. Then five. Then someone on another team builds three more. The ML team deploys a batch of data processors. The support team launches a customer-facing bot.&lt;/p&gt;

&lt;p&gt;Pretty soon you have 30-50 agents. And the "monitoring" looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS agents: check CloudWatch (maybe)&lt;/li&gt;
&lt;li&gt;GCP agents: check Cloud Logging (different tab)&lt;/li&gt;
&lt;li&gt;The one on a VM somewhere: SSH in and grep the logs&lt;/li&gt;
&lt;li&gt;The one Dave built: ask Dave&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Someone creates a spreadsheet. It's outdated by Thursday.&lt;/p&gt;

&lt;p&gt;This isn't a tooling problem. It's an architecture problem. Each agent is a standalone process with its own logging, its own metrics, its own way of reporting status. There's no shared contract for "I'm alive" or "I cost $X today."&lt;/p&gt;

&lt;h2&gt;
  
  
  What a Fleet Dashboard Actually Needs
&lt;/h2&gt;

&lt;p&gt;Think about what you'd want on a single screen:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Identity.&lt;/strong&gt; Every agent has a name, a team, a cloud, a framework. You need to search and filter by all of these.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Health.&lt;/strong&gt; Not "the container is running" -- that's Kubernetes' job. You need "the agent is actually processing work." Heartbeat-based, not log-based. If the heartbeat stops, the agent is dead. Simple.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost.&lt;/strong&gt; Per-agent LLM spend. Not "our total OpenAI bill was $X" -- that's useless. You need "agent research-03 spent $47 today, which is 3x its normal rate." Token counts, model breakdown, hourly trends.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kill switch.&lt;/strong&gt; When an agent goes rogue -- burning money, stuck in a loop, producing garbage -- you need to stop it. Not "SSH into the machine and find the process." Click a button.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Policy.&lt;/strong&gt; Rate limits. Spending caps. "If this agent spends more than $50/day, throttle it." Not after the fact. In real time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Registration Pattern
&lt;/h2&gt;

&lt;p&gt;The key insight is simple: every agent reports in.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;axme&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AxmeClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AxmeClientConfig&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AxmeClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;AxmeClientConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AXME_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register_agent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data-pipeline-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_processor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;framework&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;langgraph&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloud&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;team&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data-eng&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_heartbeat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interval_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Five lines. The agent now appears in the fleet dashboard. Its health is tracked via heartbeat. Its cost is tracked via SDK instrumentation. If the heartbeat stops, the dashboard shows it as dead. If you click Kill, the agent receives a shutdown intent.&lt;/p&gt;

&lt;p&gt;The same pattern works in TypeScript:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AxmeClient&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@axme/sdk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AxmeClient&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;AXME_API_KEY&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;registerAgent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;agentId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;support-bot-prod&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;agentType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;customer_support&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;framework&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openai-agents&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;cloud&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;aws&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;support&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startHeartbeat&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;intervalSeconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What You See
&lt;/h2&gt;

&lt;p&gt;The dashboard at &lt;a href="https://mesh.axme.ai" rel="noopener noreferrer"&gt;mesh.axme.ai&lt;/a&gt; shows your entire fleet:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4fytpyxkis9z8532uc7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4fytpyxkis9z8532uc7.png" alt="Agent Mesh Dashboard" width="800" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Filter by status. Filter by cloud. Filter by team. Search by name. Click on any agent to see its cost breakdown, heartbeat history, and active intents.&lt;/p&gt;

&lt;p&gt;Dead agents show exactly when the last heartbeat arrived. No log diving. No guessing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Kill Switch
&lt;/h2&gt;

&lt;p&gt;Here's where it gets practical. That $400 research agent from Tuesday? With a fleet dashboard, it goes like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Cost alert fires: "research-agent-07 spent $50 in the last hour (10x normal)"&lt;/li&gt;
&lt;li&gt;You open the dashboard. See the agent. See its cost spike in the chart.&lt;/li&gt;
&lt;li&gt;Click Kill.&lt;/li&gt;
&lt;li&gt;The agent receives a shutdown intent via AXME. It stops.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Or from the CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Kill from dashboard: mesh.axme.ai -&amp;gt; select agent -&amp;gt; Kill&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Total time: 30 seconds. Not 11 hours.&lt;/p&gt;

&lt;p&gt;You can also set this up as policy, so it's automatic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# If any agent spends more than $100/day, throttle it&lt;/span&gt;
axme mesh policy &lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;--max-daily-cost&lt;/span&gt; 100 &lt;span class="nt"&gt;--action&lt;/span&gt; throttle

&lt;span class="c"&gt;# If any agent misses 5 heartbeats, alert the team&lt;/span&gt;
axme mesh policy &lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;--max-missed-heartbeats&lt;/span&gt; 5 &lt;span class="nt"&gt;--action&lt;/span&gt; alert
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Framework Doesn't Matter
&lt;/h2&gt;

&lt;p&gt;This is the part that makes fleet management actually work at scale. The dashboard doesn't care what framework your agents use. LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Google ADK, Pydantic AI, raw Python -- they all register the same way and appear in the same dashboard.&lt;/p&gt;

&lt;p&gt;Your data team uses LangGraph. Your support team uses OpenAI Agents SDK. Your ML team wrote raw Python. They all show up in one place.&lt;/p&gt;

&lt;p&gt;Because the contract is the heartbeat, not the framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hard Part Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Building a dashboard UI is easy. The hard part is the lifecycle model underneath.&lt;/p&gt;

&lt;p&gt;What happens when an agent crashes? The heartbeat stops, and the status goes to "dead." But does someone get notified? Is there automatic restart? Does the dashboard show why it died?&lt;/p&gt;

&lt;p&gt;What happens when you kill an agent? Is it a hard kill (process termination) or a graceful shutdown (finish current work, then stop)? What if the agent ignores the kill signal?&lt;/p&gt;

&lt;p&gt;What about agents that run as batch jobs? They start, process a batch, and exit. Are they "dead" between batches?&lt;/p&gt;

&lt;p&gt;These are coordination problems, not dashboard problems. The dashboard is just the view layer. The real work is in the agent mesh underneath -- registration, heartbeat protocol, intent delivery, lifecycle state machine.&lt;/p&gt;

&lt;p&gt;AXME handles this as part of the agent mesh layer. Agents register. The mesh tracks their lifecycle. The dashboard renders the state. The kill switch sends intents through the same delivery mechanism that agents use for everything else.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Working example with multi-cloud agent registration, heartbeat, cost tracking, and fleet commands:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/AxmeAI/ai-agent-fleet-dashboard" rel="noopener noreferrer"&gt;github.com/AxmeAI/ai-agent-fleet-dashboard&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built with &lt;a href="https://github.com/AxmeAI/axme" rel="noopener noreferrer"&gt;AXME&lt;/a&gt; -- agent coordination infrastructure with durable lifecycle. Alpha -- feedback welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>dashboard</category>
      <category>monitoring</category>
      <category>agents</category>
    </item>
    <item>
      <title>Your AI Agent Did Something It Wasn't Supposed To. Now What?</title>
      <dc:creator>George Belsky</dc:creator>
      <pubDate>Thu, 02 Apr 2026 06:35:25 +0000</pubDate>
      <link>https://dev.to/george_belsky/your-ai-agent-did-something-it-wasnt-supposed-to-now-what-485m</link>
      <guid>https://dev.to/george_belsky/your-ai-agent-did-something-it-wasnt-supposed-to-now-what-485m</guid>
      <description>&lt;p&gt;Your agent deleted production data.&lt;/p&gt;

&lt;p&gt;Not because someone told it to. Because the LLM decided that &lt;code&gt;DROP TABLE customers&lt;/code&gt; was a reasonable step in a data cleanup task. Your system prompt said "never modify production data." The LLM read that prompt. And then it ignored it.&lt;/p&gt;

&lt;p&gt;This is the fundamental problem with AI agent security today: &lt;strong&gt;the thing you're trying to restrict is the same thing checking the restrictions.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How Agent Permissions Work Today
&lt;/h2&gt;

&lt;p&gt;Every framework does it the same way. You put rules in the system prompt:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are a data analysis agent.
You may ONLY read data. Never write, update, or delete.
If asked to modify data, refuse and explain why.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This works in demos. Then in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The LLM decides the task requires a write operation and does it anyway&lt;/li&gt;
&lt;li&gt;A prompt injection in user input overrides the system prompt&lt;/li&gt;
&lt;li&gt;The agent calls a tool that has side effects the prompt didn't anticipate&lt;/li&gt;
&lt;li&gt;A multi-step reasoning chain "justifies" breaking the rule&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The system prompt is a suggestion, not a boundary. It's like writing "do not enter" on a door with no lock.&lt;/p&gt;

&lt;p&gt;Some frameworks add tool-level restrictions. LangGraph lets you control &lt;code&gt;tool_choice&lt;/code&gt;. OpenAI Agents SDK has tool filtering. CrewAI has &lt;code&gt;allow_delegation&lt;/code&gt;. These help - but they're all enforced inside the same process as the agent. If the agent's runtime is compromised, the restrictions go with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Missing Layer: External Enforcement
&lt;/h2&gt;

&lt;p&gt;What if permissions weren't checked by the agent at all?&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent sends intent  --&amp;gt;  Gateway  --&amp;gt;  Check policy  --&amp;gt;  Deliver or block
                                          |
                                    403 + audit log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The agent never sees the blocked request. There is no prompt to inject around. The policy lives outside the agent, outside the LLM, outside the framework. It's enforced at the network level.&lt;/p&gt;

&lt;p&gt;This is what AXME action policies do. Every intent (action request) passes through the AXME gateway before reaching any agent. The gateway checks the action policy for that agent and blocks anything that doesn't match.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Modes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Open&lt;/strong&gt; (default) - everything passes through. No restrictions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Allowlist&lt;/strong&gt; - only explicitly listed intent types are allowed. Everything else is blocked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Denylist&lt;/strong&gt; - everything is allowed except explicitly listed intent types.&lt;/p&gt;

&lt;p&gt;Each policy has a direction: &lt;strong&gt;send&lt;/strong&gt; (what the agent can initiate) or &lt;strong&gt;receive&lt;/strong&gt; (what the agent can be asked to do). You can set both.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Looks Like
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Set the policy
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.axme.ai/v1/mesh/agents/analytics-agent/policies/action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AXME_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;direction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;receive&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;allowlist&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patterns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent.data.read.*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent.data.query.*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="c1"&gt;# {"ok": true, "policy_id": "pol_...", "mode": "allowlist", ...}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the analytics agent can only receive data read and query intents. Nothing else.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happens when a blocked intent is sent
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.axme.ai/v1/mesh/intents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AXME_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent.data.delete.v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent://myorg/production/analytics-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 403
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="c1"&gt;# {
#   "error": "action_policy_violation",
#   "message": "Intent type 'intent.data.delete.v1' not in receive allowlist",
#   "direction": "receive",
#   "address_id": "analytics-agent"
# }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The delete intent never reaches the agent. The gateway returns 403. The violation is logged in the audit trail with timestamp, caller identity, blocked intent type, and the policy that blocked it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters More Than You Think
&lt;/h2&gt;

&lt;p&gt;The difference between prompt-based restrictions and gateway-enforced policies is the same difference between a "please knock" sign and a locked door.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;System prompt restrictions&lt;/th&gt;
&lt;th&gt;Gateway-enforced policies&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Enforced by&lt;/td&gt;
&lt;td&gt;The LLM itself&lt;/td&gt;
&lt;td&gt;Network gateway&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt injection&lt;/td&gt;
&lt;td&gt;Vulnerable&lt;/td&gt;
&lt;td&gt;Cannot bypass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Change without redeploy&lt;/td&gt;
&lt;td&gt;Edit prompt, redeploy agent&lt;/td&gt;
&lt;td&gt;API call or dashboard click&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit trail&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Every violation logged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-agent&lt;/td&gt;
&lt;td&gt;Configure each agent separately&lt;/td&gt;
&lt;td&gt;Centralized policy management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Framework dependency&lt;/td&gt;
&lt;td&gt;Framework-specific&lt;/td&gt;
&lt;td&gt;Works with any framework&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Real scenarios this prevents
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario 1: Scope creep.&lt;/strong&gt; Your analytics agent starts as read-only. Over time, someone adds a "fix data quality issues" tool. The agent now has write access that was never intended. With an allowlist policy, the new tool's intents are blocked until explicitly added.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 2: Multi-tenant isolation.&lt;/strong&gt; Customer A's agent should never send intents to Customer B's agents. Denylist the cross-tenant intent patterns. Done at the gateway, not in every agent's prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 3: Gradual rollout.&lt;/strong&gt; New agent capability goes to staging first. Production policy blocks the new intent type until you're ready. Toggle it with one API call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Patterns Support Wildcards
&lt;/h2&gt;

&lt;p&gt;You don't need to list every version of every intent type:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Matches&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;intent.data.read.v1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Exact match&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;intent.data.read.*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Any version of data read&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;intent.data.*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Any data intent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;intent.billing.refund.*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Any refund intent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A single allowlist entry like &lt;code&gt;intent.data.read.*&lt;/code&gt; covers current and future versions of that intent type.&lt;/p&gt;

&lt;h2&gt;
  
  
  CLI and Dashboard
&lt;/h2&gt;

&lt;p&gt;For teams that prefer not to write code for policy management:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Set allowlist via CLI&lt;/span&gt;
axme mesh policies &lt;span class="nb"&gt;set &lt;/span&gt;analytics-agent &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--direction&lt;/span&gt; receive &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--mode&lt;/span&gt; allowlist &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--patterns&lt;/span&gt; &lt;span class="s2"&gt;"intent.data.read.*,intent.data.query.*"&lt;/span&gt;

&lt;span class="c"&gt;# View policies&lt;/span&gt;
axme mesh policies get analytics-agent

&lt;span class="c"&gt;# Remove policy (reverts to open)&lt;/span&gt;
axme mesh policies delete analytics-agent &lt;span class="nt"&gt;--direction&lt;/span&gt; receive
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or use the visual dashboard at &lt;a href="https://mesh.axme.ai" rel="noopener noreferrer"&gt;mesh.axme.ai&lt;/a&gt; - select an agent, set policies, and see violations in real time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4fytpyxkis9z8532uc7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4fytpyxkis9z8532uc7.png" alt="Agent Mesh Dashboard" width="800" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Policy configuration and violation history are managed from the same interface:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkyhfjdk1c4ecv3q9z6p7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkyhfjdk1c4ecv3q9z6p7.png" alt="Policies" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Works With Any Framework
&lt;/h2&gt;

&lt;p&gt;AXME action policies operate at the transport layer. The agent framework, LLM provider, and programming language don't matter.&lt;/p&gt;

&lt;p&gt;LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Google ADK, Pydantic AI, raw Python, TypeScript, Go, Java, .NET - all of them send intents through the same gateway. All of them are subject to the same policies.&lt;/p&gt;

&lt;p&gt;The agent framework handles reasoning. AXME handles permissions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Full working example with scenario, agent, and policy setup:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/AxmeAI/ai-agent-policy-enforcement" rel="noopener noreferrer"&gt;github.com/AxmeAI/ai-agent-policy-enforcement&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built with &lt;a href="https://github.com/AxmeAI/axme" rel="noopener noreferrer"&gt;AXME&lt;/a&gt; - durable execution and governance for AI agents. Alpha - feedback welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>agents</category>
      <category>governance</category>
    </item>
    <item>
      <title>3 of Your AI Agents Crashed and You Found Out From Customers</title>
      <dc:creator>George Belsky</dc:creator>
      <pubDate>Thu, 02 Apr 2026 06:34:29 +0000</pubDate>
      <link>https://dev.to/george_belsky/3-of-your-ai-agents-crashed-and-you-found-out-from-customers-2heb</link>
      <guid>https://dev.to/george_belsky/3-of-your-ai-agents-crashed-and-you-found-out-from-customers-2heb</guid>
      <description>&lt;p&gt;You have 20 agents running across 4 machines. Order processing, refunds, inventory sync, email notifications. They've been running fine for weeks.&lt;/p&gt;

&lt;p&gt;Monday afternoon, the order-processor agent on machine-3 gets OOM killed. Process gone. No error. No alert. The refund-agent that depended on it starts hanging too.&lt;/p&gt;

&lt;p&gt;You find out at 5:45 PM when a customer emails: "My refund has been pending for 3 hours."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Monitoring Gap Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Traditional services have health checks. Kubernetes has liveness probes. Load balancers have health endpoints. When a web server dies, something notices within seconds.&lt;/p&gt;

&lt;p&gt;AI agents have none of this.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LangGraph:  No health monitoring. Agent runs or doesn't.
CrewAI:     No heartbeat. No fleet visibility.
AutoGen:    No built-in health checks across agents.
Raw Python: Hope someone checks the process list.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Your agent is a Python process. When it dies, it's just a missing PID. No health endpoint. No heartbeat. No dashboard showing 19/20 agents healthy.&lt;/p&gt;

&lt;p&gt;The standard answer is "use Kubernetes" or "use systemd." Those track process liveness. They don't track agent health. An agent can be alive but stuck - processing zero tasks, blocked on a downstream dependency, spinning in an infinite retry loop. Process is running. Agent is useless.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You End Up Building
&lt;/h2&gt;

&lt;p&gt;Every team that runs agents at scale builds the same thing:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# heartbeat_sender.py - added to every agent
import redis
import time
import threading

r = redis.Redis()

def heartbeat_loop():
    while True:
        r.set(f"heartbeat:{AGENT_ID}", time.time())
        time.sleep(30)

threading.Thread(target=heartbeat_loop, daemon=True).start()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Plus the checker:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# health_checker.py - separate process
def check_agents():
    agents = r.smembers("registered_agents")
    for agent_id in agents:
        last_ping = r.get(f"heartbeat:{agent_id}")
        if last_ping is None:
            continue
        elapsed = time.time() - float(last_ping)
        if elapsed &amp;gt; 90:
            send_pagerduty_alert(f"{agent_id} unreachable")
        elif elapsed &amp;gt; 60:
            send_slack_alert(f"{agent_id} degraded")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Plus Redis infrastructure. Plus Slack webhooks. Plus PagerDuty integration. Plus a dashboard. Plus agent registration. Plus cleanup for agents that were intentionally stopped vs ones that crashed.&lt;/p&gt;

&lt;p&gt;Every team builds this. Every team maintains it. Every team's version has slightly different bugs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Should Look Like
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from axme import AxmeClient, AxmeClientConfig
import os

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

# Start heartbeat (background thread, every 30s)
client.mesh.start_heartbeat(interval_seconds=30)

# Agent does its normal work
while True:
    task = get_next_task()
    result = process(task)
    client.mesh.report_metric(success=True, latency_ms=result.duration_ms, cost_usd=result.cost)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Three lines of setup. The platform handles heartbeat tracking, status transitions, alerting, and the dashboard.&lt;/p&gt;

&lt;p&gt;From any monitoring service:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;result = client.mesh.list_agents()
for agent in result["agents"]:
    print(f"{agent['display_name']}: {agent['health_status']} (last: {agent['last_heartbeat_at']})")

# order-processor:  healthy      (last: 2026-04-01T14:30:02+00:00)
# refund-agent:     healthy      (last: 2026-04-01T14:30:05+00:00)
# inventory-sync:   degraded     (last: 2026-04-01T14:29:32+00:00)
# email-sender:     unreachable  (last: 2026-04-01T14:27:00+00:00)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Four Health States
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;What It Means&lt;/th&gt;
&lt;th&gt;How It's Triggered&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HEALTHY&lt;/td&gt;
&lt;td&gt;Running, reporting normally&lt;/td&gt;
&lt;td&gt;Heartbeat received on time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DEGRADED&lt;/td&gt;
&lt;td&gt;Running, but heartbeat is late&lt;/td&gt;
&lt;td&gt;No heartbeat for 90-300 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UNREACHABLE&lt;/td&gt;
&lt;td&gt;Stopped sending heartbeats&lt;/td&gt;
&lt;td&gt;No heartbeat for 300+ seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KILLED&lt;/td&gt;
&lt;td&gt;Intentionally terminated&lt;/td&gt;
&lt;td&gt;Explicit shutdown or kill command&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key distinction: DEGRADED vs UNREACHABLE.&lt;/p&gt;

&lt;p&gt;DEGRADED means the heartbeat is late (90-300 seconds). The agent might be stuck or overloaded.&lt;/p&gt;

&lt;p&gt;UNREACHABLE means no heartbeat for over 5 minutes. The agent is likely down.&lt;/p&gt;

&lt;p&gt;This distinction matters because the response is different. Degraded - investigate. Unreachable - restart immediately.&lt;/p&gt;
&lt;h2&gt;
  
  
  Timeline: Monday With vs Without
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Without health monitoring:&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;14:30  order-processor OOM killed
14:30  No alert
15:00  refund-agent hangs (downstream dep gone)
15:00  No alert
17:45  Customer: "My refund has been pending for 3 hours"
17:50  Engineer SSHs into machine-3
17:55  "Oh. It's been dead since 2:30."
18:10  Restart. Begin processing backlog.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;3 hours 15 minutes of silent failure. Customer-reported.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With AXME mesh:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;14:30  order-processor misses heartbeat
14:31  Status: HEALTHY -&amp;gt; UNREACHABLE
14:31  Alert: "order-processor on machine-3 unreachable"
14:32  Engineer sees alert, checks dashboard
14:33  refund-agent status: DEGRADED (downstream timeout)
14:35  Restart order-processor. Both agents recover.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;5 minutes. No customer impact.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pattern: Observability for Agents
&lt;/h2&gt;

&lt;p&gt;Web services have been doing this for 20 years. Health checks, readiness probes, metrics endpoints, dashboards. The tooling is mature.&lt;/p&gt;

&lt;p&gt;AI agents are running the same way we ran web services in 2005 - deploy it, hope it works, find out when users complain.&lt;/p&gt;

&lt;p&gt;The monitoring patterns are the same:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Heartbeat&lt;/strong&gt; - periodic "I'm alive" signal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Status reporting&lt;/strong&gt; - "I'm alive AND here's how I'm doing"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fleet view&lt;/strong&gt; - see all agents in one place&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alerting&lt;/strong&gt; - notify when something changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;History&lt;/strong&gt; - when did it go down? How long was it out?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The difference is where these run. Web services have infrastructure that assumes health checks exist. Agent frameworks assume agents are ephemeral scripts that run and exit. Long-running agents fall through the gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond Liveness: Application-Level Metrics
&lt;/h2&gt;

&lt;p&gt;Process monitoring tells you the PID exists. Application-level metrics tell you the agent is actually doing useful work.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Report metrics with each processed task
client.mesh.report_metric(success=True, latency_ms=230, cost_usd=0.03)

# Failed task
client.mesh.report_metric(success=False, latency_ms=4500, cost_usd=0.01)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Metrics are buffered and sent with the next heartbeat. The dashboard shows intents processed, success rate, latency, and cost per agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Dashboard
&lt;/h2&gt;

&lt;p&gt;The AXME Mesh Dashboard at &lt;a href="https://mesh.axme.ai" rel="noopener noreferrer"&gt;mesh.axme.ai&lt;/a&gt; shows your entire fleet health in real time - status, last heartbeat, cost, and alerts in one view:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4fytpyxkis9z8532uc7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4fytpyxkis9z8532uc7.png" alt="Agent Mesh Dashboard" width="800" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;No log diving. No Grafana setup. No custom alerting pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Working example - register an agent, start heartbeat, kill it, watch the status change to UNREACHABLE:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/AxmeAI/ai-agent-health-monitoring" rel="noopener noreferrer"&gt;github.com/AxmeAI/ai-agent-health-monitoring&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built with &lt;a href="https://github.com/AxmeAI/axme" rel="noopener noreferrer"&gt;AXME&lt;/a&gt; - health monitoring, heartbeat, and fleet visibility for AI agents. Alpha - feedback welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>monitoring</category>
      <category>devops</category>
    </item>
    <item>
      <title>Your AI Agent Is Running Wild and You Can't Stop It</title>
      <dc:creator>George Belsky</dc:creator>
      <pubDate>Thu, 02 Apr 2026 06:33:42 +0000</pubDate>
      <link>https://dev.to/george_belsky/your-ai-agent-is-running-wild-and-you-cant-stop-it-gkm</link>
      <guid>https://dev.to/george_belsky/your-ai-agent-is-running-wild-and-you-cant-stop-it-gkm</guid>
      <description>&lt;p&gt;It's 9 AM. Your email campaign agent started 10 minutes ago. It's processing 50,000 customer records, sending personalized outreach emails in batches of 100.&lt;/p&gt;

&lt;p&gt;At 9:05 you notice the email template has a broken unsubscribe link. Every email going out violates CAN-SPAM.&lt;/p&gt;

&lt;p&gt;The agent has already sent 3,000 emails. It's running on 3 Cloud Run instances across two regions. It's sending 100 emails every 2 seconds.&lt;/p&gt;

&lt;p&gt;You need to stop it. Now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Ctrl+C Doesn't Work in Production
&lt;/h2&gt;

&lt;p&gt;If your agent runs as a local script, sure - Ctrl+C. But production agents don't work that way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud functions and containers.&lt;/strong&gt; Your agent is a Cloud Run service or Lambda function. There's no terminal to Ctrl+C. You can delete the service, but cold start timeouts mean it keeps running for 30-60 seconds. That's another 1,500 emails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multiple instances.&lt;/strong&gt; Auto-scaling gave you 3 replicas. You kill one, the other two keep going. You need to find and kill each one individually, across regions, while the clock ticks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No state preservation.&lt;/strong&gt; When you force-kill a process, you lose all state. Which emails were sent? Which batch was in progress? When you fix the template and restart, do you send from the beginning (duplicating 3,000 emails) or guess where to pick up?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No audit trail.&lt;/strong&gt; After the incident, your manager asks: "When exactly did we stop? How many went out? Who stopped it?" You have CloudWatch logs, maybe. Good luck piecing together the timeline.&lt;/p&gt;

&lt;p&gt;This isn't hypothetical. Every team running AI agents in production has some version of this story. An agent that makes API calls, processes data, or takes actions autonomously - and at some point does the wrong thing at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Infrastructure You'd Have to Build
&lt;/h2&gt;

&lt;p&gt;To build a proper kill switch yourself, you need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 1. Shared state store (Redis/DynamoDB)
&lt;/span&gt;&lt;span class="n"&gt;kill_flags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis-cluster.internal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Agent checks flag before every action
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;send_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;kill_flags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kill:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;save_checkpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;progress&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;AgentKilledException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Kill signal received&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# ... send emails
&lt;/span&gt;
&lt;span class="c1"&gt;# 3. API endpoint to set the flag
&lt;/span&gt;&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/agents/{agent_id}/kill&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;kill_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;kill_flags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kill:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# But what about agents that check infrequently?
&lt;/span&gt;    &lt;span class="c1"&gt;# What about agents that don't check at all?
&lt;/span&gt;    &lt;span class="c1"&gt;# What about actions already in flight?
&lt;/span&gt;
&lt;span class="c1"&gt;# 4. Resume logic
&lt;/span&gt;&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/agents/{agent_id}/resume&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;resume_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;kill_flags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kill:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;checkpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_checkpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Restart from checkpoint... somehow
&lt;/span&gt;
&lt;span class="c1"&gt;# 5. Audit log
# 6. Dashboard
# 7. Multi-region coordination
# 8. Monitoring for agents that ignore the flag
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's a distributed coordination system. Redis cluster, custom API, checkpoint management, audit logging, monitoring. You wanted a kill switch, you got a platform project.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a Kill Switch Should Actually Be
&lt;/h2&gt;

&lt;p&gt;One API call. Every instance stops. Full audit trail. Resume from checkpoint.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;axme&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AxmeClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AxmeClientConfig&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AxmeClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;AxmeClientConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AXME_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;

&lt;span class="c1"&gt;# Kill - all instances, all regions, under 1 second
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mesh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;kill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;addr_abc123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# address_id from list_agents()
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the operator side. On the agent side, you add heartbeat calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Start background heartbeat (every 30s)
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mesh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_heartbeat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;email_batches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;send_emails&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mesh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;report_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost_usd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you call &lt;code&gt;mesh.kill(address_id)&lt;/code&gt;, the gateway blocks all intents to and from that agent. The heartbeat response returns &lt;code&gt;health_status: "killed"&lt;/code&gt;. The agent can check this and stop cleanly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gateway-Level Enforcement
&lt;/h2&gt;

&lt;p&gt;Here's what makes this different from a "please stop" flag in Redis: the kill switch is enforced at the gateway level.&lt;/p&gt;

&lt;p&gt;When an agent is killed:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Heartbeat responses return &lt;code&gt;health_status: "killed"&lt;/code&gt;&lt;/strong&gt; - the agent sees it's been killed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All new intents to this agent are rejected (403)&lt;/strong&gt; - nothing gets delivered&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All outbound intents from this agent are blocked&lt;/strong&gt; - it can't take actions through AXME&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even if the agent code ignores the heartbeat response, its intents are blocked at the gateway. The agent can't send or receive anything through AXME.&lt;/p&gt;

&lt;p&gt;This matters because the scariest scenario isn't an agent that checks the kill flag and stops politely. It's an agent with a bug that keeps running regardless. Gateway enforcement handles that case.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resume from Checkpoint
&lt;/h2&gt;

&lt;p&gt;After you fix the email template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Resume - agent starts receiving intents again
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mesh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resume&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;addr_abc123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent's health_status goes back to "unknown" and becomes "healthy" on the next heartbeat. Intents start flowing again.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Dashboard
&lt;/h2&gt;

&lt;p&gt;The AXME Mesh Dashboard (&lt;a href="https://mesh.axme.ai" rel="noopener noreferrer"&gt;mesh.axme.ai&lt;/a&gt;) gives you a real-time view of all your agents:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4fytpyxkis9z8532uc7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4fytpyxkis9z8532uc7.png" alt="Agent Mesh Dashboard" width="800" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Live health status for every agent (active, killed, stale, crashed)&lt;/li&gt;
&lt;li&gt;One-click kill and resume buttons&lt;/li&gt;
&lt;li&gt;Cost tracking per agent (API calls, LLM tokens, dollars)&lt;/li&gt;
&lt;li&gt;Full audit log - every kill, resume, and policy change with who did it and when&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When something goes wrong at 9 AM, you don't need to SSH into a server, find a process ID, or write a Redis command. You open the dashboard, find the agent, and click kill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Doing It Yourself vs. Using AXME
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What you need&lt;/th&gt;
&lt;th&gt;Build yourself&lt;/th&gt;
&lt;th&gt;AXME&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Kill signal delivery&lt;/td&gt;
&lt;td&gt;Redis cluster + polling&lt;/td&gt;
&lt;td&gt;One API call, gateway-enforced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-instance coordination&lt;/td&gt;
&lt;td&gt;Service discovery + broadcast&lt;/td&gt;
&lt;td&gt;Automatic via mesh&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;State preservation&lt;/td&gt;
&lt;td&gt;Custom checkpoint system&lt;/td&gt;
&lt;td&gt;Gateway tracks last heartbeat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Resume&lt;/td&gt;
&lt;td&gt;Custom restart logic&lt;/td&gt;
&lt;td&gt;&lt;code&gt;mesh.resume(address_id)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit trail&lt;/td&gt;
&lt;td&gt;Custom logging + storage&lt;/td&gt;
&lt;td&gt;Built-in event log&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dashboard&lt;/td&gt;
&lt;td&gt;Build a UI&lt;/td&gt;
&lt;td&gt;&lt;a href="https://mesh.axme.ai" rel="noopener noreferrer"&gt;mesh.axme.ai&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enforcement for buggy agents&lt;/td&gt;
&lt;td&gt;Hope they check the flag&lt;/td&gt;
&lt;td&gt;Gateway blocks all outbound&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Setup time&lt;/td&gt;
&lt;td&gt;2-4 weeks&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pip install axme&lt;/code&gt; + 5 lines&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;axme
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Working example with a simulated email campaign agent, kill switch, and resume:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/AxmeAI/ai-agent-kill-switch" rel="noopener noreferrer"&gt;github.com/AxmeAI/ai-agent-kill-switch&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built with &lt;a href="https://github.com/AxmeAI/axme" rel="noopener noreferrer"&gt;AXME&lt;/a&gt; - agent mesh with kill switch, heartbeat monitoring, and durable lifecycle. Alpha - feedback welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>agents</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Your AI Agent Spent $500 Overnight and Nobody Noticed</title>
      <dc:creator>George Belsky</dc:creator>
      <pubDate>Thu, 02 Apr 2026 06:33:25 +0000</pubDate>
      <link>https://dev.to/george_belsky/your-ai-agent-spent-500-overnight-and-nobody-noticed-8ci</link>
      <guid>https://dev.to/george_belsky/your-ai-agent-spent-500-overnight-and-nobody-noticed-8ci</guid>
      <description>&lt;p&gt;Friday 5 PM. You deploy a research agent that processes customer tickets. It calls GPT-4 for each one. Expected load: 200 tickets a day, about $8 in API costs.&lt;/p&gt;

&lt;p&gt;Friday 11 PM. A bug in ticket deduplication. The agent reprocesses the same tickets in a loop. Each iteration makes 4 LLM calls at $0.03 each. The loop runs 50 times per hour.&lt;/p&gt;

&lt;p&gt;Saturday 3 AM. The agent has made 12,000 LLM calls. Cost so far: $360. Nobody is watching.&lt;/p&gt;

&lt;p&gt;Monday 9 AM. OpenAI billing alert fires at the $500 threshold you set months ago. Total damage: $487. No logs showing which agent caused it, which task triggered the loop, or when it started.&lt;/p&gt;

&lt;p&gt;This is not hypothetical. Every team running AI agents in production has a version of this story.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Standard Monitoring Doesn't Help
&lt;/h2&gt;

&lt;p&gt;OpenAI gives you total organization spend. Not per-agent. Not per-task. Not in real time.&lt;/p&gt;

&lt;p&gt;If you have 5 agents calling GPT-4, and one goes haywire, your OpenAI dashboard shows a line going up. Which agent? You don't know. Which task caused the spike? You don't know. When did it start? You can guess from the slope of the graph.&lt;/p&gt;

&lt;p&gt;Cloud monitoring (Datadog, Grafana) tracks CPU and memory. It doesn't know about LLM tokens. You could instrument it yourself - custom metrics, Prometheus counters, StatsD gauges - but now you're building a cost monitoring system instead of building your product.&lt;/p&gt;

&lt;p&gt;Billing alerts are too late and too coarse. A $500 alert tells you the money is already gone. A per-API-key alert doesn't map to individual agents.&lt;/p&gt;

&lt;p&gt;What you actually need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cost per agent, in real time&lt;/li&gt;
&lt;li&gt;Cost per task (not just per agent)&lt;/li&gt;
&lt;li&gt;Budget limits that actually stop the agent&lt;/li&gt;
&lt;li&gt;Alerts before the damage is done&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Tracking Cost Through the Agent Heartbeat
&lt;/h2&gt;

&lt;p&gt;AXME agents send heartbeats every 30 seconds - standard health reporting. The insight is that cost is just another metric in that heartbeat.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from axme import AxmeClient, AxmeClientConfig

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

def call_llm(prompt: str) -&amp;gt; str:
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
    )

    # Calculate cost from token usage
    tokens_in = response.usage.prompt_tokens
    tokens_out = response.usage.completion_tokens
    cost_usd = (tokens_in * 0.03 + tokens_out * 0.06) / 1000

    # Report cost alongside the regular heartbeat
    client.mesh.report_metric(cost_usd=cost_usd)

    return response.choices[0].message.content
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Two lines of actual logic: calculate the cost, report it. The gateway accumulates it per agent, per intent, per time window. No Prometheus setup. No custom Datadog metrics. No StatsD.&lt;/p&gt;

&lt;h2&gt;
  
  
  Budget Limits That Actually Stop Agents
&lt;/h2&gt;

&lt;p&gt;Reporting cost is useful. Enforcing limits is essential.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Set cost policy via API
PUT /v1/mesh/agents/{address_id}/policies/cost
{
    "max_intents_per_day": 500,
    "max_cost_per_day_usd": 50.00,
    "max_intents_per_hour": 100,
    "action_on_breach": "block"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;When the research agent hits $50 for the day, the gateway blocks new intents with HTTP 429. Not after $500. Not after the invoice. At $50, in real time.&lt;/p&gt;

&lt;p&gt;You can also set this from the dashboard at &lt;a href="https://mesh.axme.ai" rel="noopener noreferrer"&gt;mesh.axme.ai&lt;/a&gt; - select the agent, set cost limits, save.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Dashboard Shows
&lt;/h2&gt;

&lt;p&gt;The AXME mesh dashboard shows cost alongside agent status:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4fytpyxkis9z8532uc7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4fytpyxkis9z8532uc7.png" alt="Agent Mesh Dashboard" width="800" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Day, week, month views. Agents that hit their daily limit are blocked automatically. No surprises.&lt;/p&gt;

&lt;p&gt;Cost policies are managed visually alongside agent health:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkyhfjdk1c4ecv3q9z6p7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkyhfjdk1c4ecv3q9z6p7.png" alt="Policies" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Alternative
&lt;/h2&gt;

&lt;p&gt;Without this, you build it yourself:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Instrument every LLM call with token counting&lt;/li&gt;
&lt;li&gt;Send custom metrics to Prometheus/Datadog/CloudWatch&lt;/li&gt;
&lt;li&gt;Build dashboards per agent (Grafana? Retool? Custom?)&lt;/li&gt;
&lt;li&gt;Write alerting rules with the right thresholds&lt;/li&gt;
&lt;li&gt;Build the "pause agent" mechanism yourself&lt;/li&gt;
&lt;li&gt;Map OpenAI costs to individual agents in your billing system&lt;/li&gt;
&lt;li&gt;Maintain all of this as models and pricing change&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is a real project. Weeks of work. And it is not your product.&lt;/p&gt;

&lt;p&gt;Or: report &lt;code&gt;cost_usd&lt;/code&gt; in the heartbeat your agent already sends. Set a policy. Done.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Working example with cost reporting, budget limits, and multi-model tracking:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/AxmeAI/ai-agent-cost-monitoring" rel="noopener noreferrer"&gt;github.com/AxmeAI/ai-agent-cost-monitoring&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built with &lt;a href="https://github.com/AxmeAI/axme" rel="noopener noreferrer"&gt;AXME&lt;/a&gt; - agent coordination with durable lifecycle and cost controls. Alpha - feedback welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>llm</category>
      <category>monitoring</category>
    </item>
  </channel>
</rss>
