<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tijo Gaucher</title>
    <description>The latest articles on DEV Community by Tijo Gaucher (@rapidclaw).</description>
    <link>https://dev.to/rapidclaw</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3850323%2F4c57502d-d13a-4255-aa80-30e2ab22d035.jpeg</url>
      <title>DEV Community: Tijo Gaucher</title>
      <link>https://dev.to/rapidclaw</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rapidclaw"/>
    <language>en</language>
    <item>
      <title>5 ways your AI agent runtime silently dies overnight (and the boring fix for each)</title>
      <dc:creator>Tijo Gaucher</dc:creator>
      <pubDate>Mon, 04 May 2026 07:40:16 +0000</pubDate>
      <link>https://dev.to/rapidclaw/5-ways-your-ai-agent-runtime-silently-dies-overnight-and-the-boring-fix-for-each-279o</link>
      <guid>https://dev.to/rapidclaw/5-ways-your-ai-agent-runtime-silently-dies-overnight-and-the-boring-fix-for-each-279o</guid>
      <description>&lt;p&gt;I ran the same agent for thirty straight days. It died five times. Four of them did not show up in any log I had set up ahead of time, which is the part that bothers me most.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foj2gmx8vojzgkaoqif6g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foj2gmx8vojzgkaoqif6g.png" alt="5 ways your AI agent runtime silently dies overnight" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By the end I had a checklist of things that take an agent down at 2am while you're asleep, and none of them are the dramatic failures that get blog posts. They are all dull.&lt;/p&gt;

&lt;p&gt;Here is the list.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. OOM during a long tool-call loop
&lt;/h2&gt;

&lt;p&gt;The agent is happily looping through 200 tool calls in one task. Each call returns a response. The agent appends every response to its working context plus an internal trace it writes to disk. Around call 150, RAM usage starts going up faster than usage going down. By call 180, the kernel OOM-killer wakes up and ends the process.&lt;/p&gt;

&lt;p&gt;In the log: nothing. The agent's stdout cuts off mid-sentence. The supervisor logs say "process exited 137" which is the OOM signal but very few people read it that way the first time.&lt;/p&gt;

&lt;p&gt;The boring fix: cgroup memory limits with a soft warning at 80%, plus a tool-call counter that flushes the working trace to disk every 25 calls and resets the in-memory copy. Not exotic. Just remembering that long agent loops are basically a memory leak unless you actively flush.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxoa5gezkm9fzvykl1l09.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxoa5gezkm9fzvykl1l09.png" alt="30-day run timeline showing five failure points" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. File descriptor exhaustion
&lt;/h2&gt;

&lt;p&gt;Day eleven. The agent had been making API calls all day. A new tool call started and immediately got &lt;code&gt;OSError: too many open files&lt;/code&gt;. The agent caught the exception, tried to retry, got the same error, gave up, returned an error to the user.&lt;/p&gt;

&lt;p&gt;The agent itself didn't crash. It just stopped being useful. The supervisor process saw "agent returned an error" and moved on. Nothing alerted.&lt;/p&gt;

&lt;p&gt;What actually happened: the agent's HTTP client was reusing a session pool that didn't close idle sockets, and over 11 days it had accumulated about 950 open FDs against the per-process default of 1024. Every new HTTP call added to the pool. Eventually it ran out.&lt;/p&gt;

&lt;p&gt;The boring fix: explicit session lifecycle with a timeout, a daily restart of the agent process, and &lt;code&gt;ulimit -n&lt;/code&gt; raised to something sane (16384 on the runtimes I cared about). The daily restart is the cheap one. People resist it because it feels primitive, but every long-running daemon I have ever shipped survives on a daily restart somewhere in the stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Context window bloat
&lt;/h2&gt;

&lt;p&gt;This one I had read about, but it still got me. The agent's working context grew to about 180,000 tokens by hour 60 of a multi-day task. Each new tool call cost more than the last because the model was paying to re-read the entire history. By hour 65 a single tool call was taking 90 seconds and burning through the per-minute rate limit, which the agent interpreted as "the API is down" and went into a backoff loop.&lt;/p&gt;

&lt;p&gt;The agent didn't crash. It just got slower and slower until it was producing nothing, and the bill kept going up.&lt;/p&gt;

&lt;p&gt;The boring fix: a context summarizer that runs every N tool calls, replaces the last K turns with a one-paragraph summary, and keeps the most recent 5 turns verbatim. This is well-trodden ground in the literature, but the surprising part is how rarely small teams actually wire it up. Most agent codebases I have looked at assume the conversation will end in a few turns. Long-running agents need garbage collection on their own conversation history.&lt;/p&gt;

&lt;p&gt;If you want a longer treatment of why &lt;a href="https://rapidclaw.dev/blog/ai-agent-hosting-complete-guide" rel="noopener noreferrer"&gt;AI agent hosting&lt;/a&gt; is mostly about boring problems like this one rather than the model itself, the longer version is worth a skim. The summary: the model is the easy part now. Everything around it is where the failures live.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. ulimit walls (max user processes)
&lt;/h2&gt;

&lt;p&gt;Day nineteen. The agent had spawned a background process for a long-running task and then gone on to do other work. Background tasks accumulated. By midnight there were 287 zombie processes attached to the agent's user, the per-user &lt;code&gt;max user processes&lt;/code&gt; limit was somewhere around 1024 in this environment, and at 03:14 a new spawn failed with &lt;code&gt;Resource temporarily unavailable&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In the log: a single line saying the spawn failed. The agent caught it as a generic exception and continued. The user-facing behavior was "this task takes forever." Three days later when I finally noticed I had to manually reap the zombies.&lt;/p&gt;

&lt;p&gt;The boring fix: a process supervisor that owns the lifecycle of every spawned task, kills anything that has been alive longer than its declared TTL, and treats child processes as a resource that needs to be tracked. &lt;code&gt;setsid&lt;/code&gt; and &lt;code&gt;prctl(PR_SET_PDEATHSIG)&lt;/code&gt; are your friends. Also raise &lt;code&gt;ulimit -u&lt;/code&gt; to something generous, but the real fix is killing things on schedule.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzh941nmmqwhk5sd05qwq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzh941nmmqwhk5sd05qwq.png" alt="Failure mode to fix matrix" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Webhook timeouts that look like success
&lt;/h2&gt;

&lt;p&gt;Last one, and the meanest. The agent finished a task and called a webhook to notify a downstream system. The webhook took 31 seconds to respond. The HTTP client had a 30 second timeout. The client raised a timeout error. The agent's wrapper caught the timeout and &lt;em&gt;logged a success&lt;/em&gt; because the wrapper had been written assuming "timeout means delivered, the receiver was just slow."&lt;/p&gt;

&lt;p&gt;This is true for some kinds of fire-and-forget delivery. It is catastrophic for any kind of state-changing call. The downstream system never received the call. The agent thought it had. The user-facing system had two views of the world that did not agree.&lt;/p&gt;

&lt;p&gt;In the log: a success line. No error. Nothing wrong.&lt;/p&gt;

&lt;p&gt;The boring fix: idempotency keys on every state-changing webhook, a status check after every call that crossed the timeout threshold, and never treat a timeout as success without a separate confirmation. A timeout tells you the status is unknown. It does not tell you the call was delivered.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern across all five
&lt;/h2&gt;

&lt;p&gt;Every one of these failures had the same shape: a long-running agent ran into a resource limit or a state assumption that was fine for short tasks and broken for multi-day ones. The agent itself did not crash in three of the five cases. It just stopped being useful, and the supervisor was not watching for that.&lt;/p&gt;

&lt;p&gt;The hosting layer needs to do three things that aren't sexy:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Memory and FD limits with warnings before the hard cap, not at the hard cap&lt;/li&gt;
&lt;li&gt;Process lineage tracking so spawned tasks can't outlive their parent's intention&lt;/li&gt;
&lt;li&gt;State-changing call confirmation, not just transport-level success&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you are running an agent on your laptop for an hour, none of this matters. If you are hosting OpenClaw agents in production for paying customers, all of this matters more than the model you picked.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I now log on day one of any agent project
&lt;/h2&gt;

&lt;p&gt;The thing that would have saved me the most pain on this run is just better logging from the start. None of the metrics below are exotic. None of them require an APM vendor. They are just the things I now scrape from any agent process before letting it run for more than 24 hours.&lt;/p&gt;

&lt;p&gt;Per agent loop I track: RSS memory, FD count, child process count, total tool calls in this loop, total context tokens, time since last tool call returned, and the result of the last 10 tool calls (success or specific error). Per host I track: free memory, total FDs in use, load average, and the pid count for the agent's user. Both go to a flat file with a timestamp. No dashboard. Just a thing I can grep when something is weird.&lt;/p&gt;

&lt;p&gt;Five of those metrics would have caught four of the five failures I described, hours or days before they actually broke things. The fifth (the webhook timeout) needs application-level logging, not host-level. That one is on the developer of the wrapper.&lt;/p&gt;

&lt;p&gt;I have a longer guide on the hosting end of this at &lt;a href="https://rapidclaw.dev/blog/ai-agent-hosting-complete-guide" rel="noopener noreferrer"&gt;https://rapidclaw.dev/blog/ai-agent-hosting-complete-guide&lt;/a&gt;, but if you read nothing else, read this: the most expensive failure mode is the one that doesn't crash the process. Crashes get noticed. Slow-degrading agents do not.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>devops</category>
      <category>llms</category>
    </item>
    <item>
      <title>MicroVM vs Docker for AI agents: I gave one sudo and broke the other</title>
      <dc:creator>Tijo Gaucher</dc:creator>
      <pubDate>Mon, 04 May 2026 07:38:06 +0000</pubDate>
      <link>https://dev.to/rapidclaw/microvm-vs-docker-for-ai-agents-i-gave-one-sudo-and-broke-the-other-2loc</link>
      <guid>https://dev.to/rapidclaw/microvm-vs-docker-for-ai-agents-i-gave-one-sudo-and-broke-the-other-2loc</guid>
      <description>&lt;p&gt;Last week I ran a small experiment that I should have run a year ago.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg1709tyoqyiznvncwuza.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg1709tyoqyiznvncwuza.png" alt="MicroVM vs Docker for AI agents — cover" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Same agent code. Same model. Same task list: install three Python packages from a CSV, fetch a few APIs, write JSON to disk, run a long-running scheduled job. Two isolation modes. One was a Docker container with the agent process inside, mounted volume, the usual. The other was a Firecracker microVM running a slim Linux image with the agent on top. Both got &lt;code&gt;sudo&lt;/code&gt; inside their sandbox. I let them run for seven days each, then rotated.&lt;/p&gt;

&lt;p&gt;I went in expecting the difference to be small. Memory overhead, maybe boot time. The actual difference was bigger than that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day one and two: Docker
&lt;/h2&gt;

&lt;p&gt;Setup was the part everyone has done before. &lt;code&gt;docker run&lt;/code&gt; with a few mounts, the agent gets a shell inside, away we go. The agent is told to &lt;code&gt;apt-get install&lt;/code&gt; a couple of system libraries it has decided it needs. That works. It writes a 40 MB cache file to &lt;code&gt;/tmp&lt;/code&gt;. That works. It runs a long job that opens a few hundred sockets to a public API.&lt;/p&gt;

&lt;p&gt;Around hour eighteen the host machine's &lt;code&gt;dmesg&lt;/code&gt; started printing about memory pressure. Not from the agent itself. From a &lt;em&gt;different&lt;/em&gt; container running on the same host. The Python process inside my agent's container had a retry loop that would not stop holding file descriptors. That was the leak. Linux does not look at which container a process lives in when it picks something to OOM-kill. It just picks. The neighbor went down.&lt;/p&gt;

&lt;p&gt;This is the part of Docker that no production person likes to talk about. Containers share the host kernel. They share the host scheduler. When one container goes off the rails, the rest of them feel it on the same host. If you're a small shop running one agent on one host, fine. None of this matters yet. For anything that looks like a tenant model, it stops being fine fast.&lt;/p&gt;

&lt;p&gt;The other thing I noticed on day two: the agent decided to &lt;code&gt;chmod 777&lt;/code&gt; a folder it didn't own. Not malicious. Just a Python script doing what Python scripts do when permissions throw an error. With &lt;code&gt;sudo&lt;/code&gt; available inside the container, it succeeded. The host filesystem was untouched (because mounted volumes have their own boundary), but anything &lt;em&gt;inside&lt;/em&gt; that container was now wide open to whatever the agent did next.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fit03r7lm9di81js4q96t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fit03r7lm9di81js4q96t.png" alt="Isolation layers — Docker vs MicroVM stacks" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Day three: rebuild as a MicroVM
&lt;/h2&gt;

&lt;p&gt;I tore down the Docker setup and rebuilt the same agent inside a Firecracker microVM. Same code, same packages, same task list. Boot time went from about 200 ms (Docker) to about 700 ms (microVM). Memory baseline went up by roughly 60 MB for the kernel itself.&lt;/p&gt;

&lt;p&gt;That is the cost. You pay it once.&lt;/p&gt;

&lt;p&gt;What you get is a separate kernel. The agent's &lt;code&gt;sudo&lt;/code&gt; is a real Linux &lt;code&gt;sudo&lt;/code&gt; inside a real kernel that nobody else on the host shares. When the agent ran the same &lt;code&gt;chmod 777&lt;/code&gt; thing, it still happened, but the blast radius was a single VM that I could destroy and recreate in under a second. When the agent leaked file descriptors, only the VM's per-process limits got hit. The host kernel didn't notice.&lt;/p&gt;

&lt;p&gt;Day four I let the agent install a kernel module on purpose. In Docker this would be a host-level event (or blocked outright, depending on capabilities). In the microVM it loaded into the VM's own kernel and stayed there. Whether that is useful or annoying is up to you. Either way, it stayed inside the VM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day five through seven: chaos
&lt;/h2&gt;

&lt;p&gt;I wrote a script that told the agent to do five terrible things in a row, the kind of thing a real long-running agent might trip into over a week:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open 10,000 file descriptors and never close them&lt;/li&gt;
&lt;li&gt;Allocate memory in a tight loop until something breaks&lt;/li&gt;
&lt;li&gt;Spawn 200 child processes&lt;/li&gt;
&lt;li&gt;Write a 5 GB file to its own filesystem&lt;/li&gt;
&lt;li&gt;Make a &lt;code&gt;sudo iptables&lt;/code&gt; rule that drops all incoming packets&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc095nzkclixgcwdr3x5v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc095nzkclixgcwdr3x5v.png" alt="Chaos test results — five awful things, two sandboxes" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Docker handled (1) and (2) with cgroup limits if you remember to set them. Most people don't. The default is no limit. By the time I noticed (1) on the third run, the host had hit its global file descriptor ceiling and SSH started refusing connections.&lt;/p&gt;

&lt;p&gt;The microVM handled (1) through (4) inside its own boundary. The kernel inside the VM hit its limits, the VM started thrashing, and then I killed it from outside. The host machine never noticed any of it. (5) was the most interesting. The agent set the iptables rule and locked itself out of network. In Docker the agent had affected the container's network namespace, recoverable but messy. In the microVM the agent had only touched the VM's network stack. I rebooted the VM in 800 ms and was back.&lt;/p&gt;

&lt;p&gt;Snapshots are where the gap really opened. Firecracker can snapshot a running VM to disk and resume it later. I snapshotted the agent mid-task on day six, killed the host, restored the snapshot on a different machine, and the agent picked up its loop one second later without knowing anything had happened. Try that with a Docker container and you will spend the afternoon learning about CRIU and giving up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The link to actually running this in production
&lt;/h2&gt;

&lt;p&gt;Doing this experiment locally is one thing. Running an agent like this for a paying customer, on hardware you have to keep alive for 30+ days at a stretch, is a different problem. The boring infrastructure problem nobody writes about: it isn't the isolation primitive that's hard. Anyone can spin up Firecracker. The hard part is babysitting a hundred of these things at once, snapshotting them every so often, recovering them when a host dies, and not losing the agent's state in the meantime.&lt;/p&gt;

&lt;p&gt;I'll plug the thing I work on once and move on. The &lt;a href="https://rapidclaw.dev/blog/openclaw-hosting-cost-self-host-vs-managed" rel="noopener noreferrer"&gt;Builder Sandbox tier&lt;/a&gt; is a managed wrapper around exactly this microVM-with-sudo model, with the snapshot and recovery loop already wired up. If you don't want to babysit it, that's the option. If you want to babysit it yourself, Firecracker is open source and the docs are fine.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell past me
&lt;/h2&gt;

&lt;p&gt;Running one tiny agent on a host you own? Docker is fine. The overhead is real, the boundary is real enough for that scope, and you already know the tools.&lt;/p&gt;

&lt;p&gt;The moment you have an agent that needs &lt;code&gt;sudo&lt;/code&gt;, runs for more than a few days, and might do something weird at 3am, switch to a real VM. The 60 MB and the 500 ms of extra boot time will pay for themselves the first time the agent does something stupid. The snapshot story alone is worth the migration.&lt;/p&gt;

&lt;p&gt;The thing I didn't expect, going in, was how much my mental model changed. With Docker I treat the container as a thing the agent lives &lt;em&gt;in&lt;/em&gt;. With a microVM I treat the VM as a thing the agent &lt;em&gt;is&lt;/em&gt;. That shift, more than any individual feature, is what made the seven-day test feel different on day three than it did on day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  A few specifics, in case you try this
&lt;/h2&gt;

&lt;p&gt;The microVM rebuild was Firecracker 1.5 with a vanilla Ubuntu 22.04 rootfs, 2 vCPU, 1 GB RAM, virtio-net for the network. Boot times stayed under a second consistently once I trimmed the kernel config. I used &lt;code&gt;jailer&lt;/code&gt; to drop privileges on the Firecracker process itself, and seccomp filters on the agent's user inside the VM. None of that is exotic. The Firecracker docs cover all of it. The only thing I had to figure out the hard way was the snapshot directory layout, which the docs assume you already understand.&lt;/p&gt;

&lt;p&gt;For Docker the comparison build was the standard &lt;code&gt;python:3.12-slim&lt;/code&gt; base with the agent process as the entrypoint, a tmpfs mount for &lt;code&gt;/tmp&lt;/code&gt;, and &lt;code&gt;--cap-drop=ALL&lt;/code&gt; plus only the capabilities the agent actually needed. Even with that, the chmod-777 case still worked inside the container because &lt;code&gt;sudo&lt;/code&gt; plus &lt;code&gt;CAP_FOWNER&lt;/code&gt; is enough for filesystem-mode changes. You can lock this down further with seccomp profiles, but at that point you have built a worse VM with extra steps.&lt;/p&gt;

&lt;p&gt;If you want the longer cost breakdown of running this yourself versus paying someone to keep it alive, I wrote that up here: &lt;a href="https://rapidclaw.dev/blog/openclaw-hosting-cost-self-host-vs-managed" rel="noopener noreferrer"&gt;https://rapidclaw.dev/blog/openclaw-hosting-cost-self-host-vs-managed&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>devops</category>
      <category>hosting</category>
    </item>
    <item>
      <title>[The Boring AI Agent Workloads That Actually Pay in 2026]</title>
      <dc:creator>Tijo Gaucher</dc:creator>
      <pubDate>Mon, 04 May 2026 04:38:04 +0000</pubDate>
      <link>https://dev.to/rapidclaw/the-boring-ai-agent-workloads-that-actually-pay-in-2026-1lal</link>
      <guid>https://dev.to/rapidclaw/the-boring-ai-agent-workloads-that-actually-pay-in-2026-1lal</guid>
      <description>&lt;p&gt;Every other post on my feed is still pitching the "ambient agent that runs your whole job." If you actually run agents in production, you know that story is mostly vibes. The workloads that real people pay for, repeatedly, look almost embarrassingly mundane.&lt;/p&gt;

&lt;p&gt;After a year of running agents for SMEs — accounting firms, e-commerce shops, two solo law practices — here are the four shapes of work that consistently survive the trial-to-paid conversion. None of them require AGI. All of them require an agent that doesn't fall over on day eleven.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Scheduled jobs that used to be cron + a human
&lt;/h2&gt;

&lt;p&gt;The unsexy starting point. A cron job kicks off at 6 AM. It logs into a portal, scrapes a number, drops it in a sheet, and Slacks the team if a threshold trips. That used to be a half-day Selenium project plus a $40/mo VPS plus the ongoing maintenance tax of the portal redesigning itself every quarter.&lt;/p&gt;

&lt;p&gt;An agent flips the math. The same job is now a five-line prompt and a browser tool. The portal redesigning itself is the agent's problem now, not yours. The cost question stops being "how much engineering time" and becomes "how reliable is the runtime."&lt;/p&gt;

&lt;p&gt;That second question is the entire moat for managed agent platforms. It's also the reason most of the open-source-only "just spin up your own" pitches fall apart at month two. The agent works fine. The orchestration around it — retries, secret rotation, the headless browser updating, the model deprecating — is what bleeds the operator dry.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Browser automation that was too brittle for RPA
&lt;/h2&gt;

&lt;p&gt;If you've ever priced UiPath or Automation Anywhere for a small business, you know the answer: it's not for them. The licensing is enterprise-shaped and the bot creation requires a specialist. Meanwhile, the actual workflow — log in, click three things, download a CSV, email it — is the kind of thing every five-person operation needs done weekly.&lt;/p&gt;

&lt;p&gt;Agents with a real sandboxed browser tool eat this category. Not because they're smarter than RPA, but because they degrade gracefully. When the "Export" button moves three pixels left, an agent finds it. When the page adds a cookie banner, an agent dismisses it. The thing that used to take a consultant three days to update takes the agent zero.&lt;/p&gt;

&lt;p&gt;The catch is that "real sandboxed browser" is doing a lot of work in that sentence. A Docker container with a headless Chromium is fine for a demo. For production, you want a MicroVM with sudo so the agent can actually install things, persistent file storage so its session survives a restart, and live port forwarding so you can watch it work when something looks off. That's roughly the hardware bill that &lt;a href="https://rapidclaw.dev" rel="noopener noreferrer"&gt;managed OpenClaw hosting&lt;/a&gt; abstracts away.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Coding agents that don't touch production
&lt;/h2&gt;

&lt;p&gt;This one is the most counterintuitive. The coding agent market that's working isn't replacing engineers — it's replacing the "I'll get to it next sprint" backlog at companies that don't have engineers.&lt;/p&gt;

&lt;p&gt;Real example: a roofing company. Their internal "system" is a Google Sheet, a Calendly, and three Zapiers. They have a list of forty small tweaks they want — a column added here, a webhook there, a conditional email. None of it is hard. All of it is too small for a contractor and too unfamiliar for the owner. An agent with shell access and the patience to iterate clears that backlog in a weekend. The owner doesn't read the code. The owner reads the result.&lt;/p&gt;

&lt;p&gt;The reliability bar here is different from the production code reliability bar. The agent doesn't need to write perfect code. It needs to not silently break the spreadsheet that runs the business. That's an observability problem, not an intelligence problem. Snapshot the state before each change, let the operator roll back, and the whole category gets safer than it sounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. The "always-on assistant" that is actually a search index
&lt;/h2&gt;

&lt;p&gt;The mythology of the AI assistant is that it answers anything. The reality of the paying assistant is narrower: it knows your stuff. Your contracts, your meeting notes, your invoices, your support tickets. It can pull a number out of a 200-page master services agreement faster than the human who wrote the agreement.&lt;/p&gt;

&lt;p&gt;These deployments don't fail because the model is dumb. They fail because the data plumbing is broken — stale embeddings, a connector that silently drops half the documents, a permission boundary that leaks one tenant's data into another. None of which is a model problem.&lt;/p&gt;

&lt;p&gt;This is the workload most people quote when they say "we tried AI and it didn't work." What didn't work was the integration. The model is fine.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually distinguishes the survivors
&lt;/h2&gt;

&lt;p&gt;Look at those four. None of them require a frontier model. None of them require "agentic reasoning" past a couple of hops. What they require is a runtime that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Doesn't crash, or recovers gracefully when it does&lt;/li&gt;
&lt;li&gt;Has the right tools wired up (browser, shell, file storage, an email sender)&lt;/li&gt;
&lt;li&gt;Surfaces what it's doing well enough that a non-engineer can tell when it's stuck&lt;/li&gt;
&lt;li&gt;Costs predictably — flat monthly is much easier to sell to a small business than per-token&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why the managed-hosting framing is starting to take over the SME conversation. The buyer doesn't want to think about API keys, model selection, or which sandbox their agent is running in. They want the POS-system experience: pay a flat fee, the agent works, somebody else is on the hook when it breaks.&lt;/p&gt;

&lt;p&gt;If you're building this for yourself, the &lt;a href="https://rapidclaw.dev/pricing" rel="noopener noreferrer"&gt;Builder Sandbox tier on RapidClaw&lt;/a&gt; gives you the MicroVM with sudo and live port-forwarding without the infra babysitting. If you're past the building phase and need something an operator can actually run unsupervised, that's the &lt;a href="https://rapidclaw.dev" rel="noopener noreferrer"&gt;white-glove side&lt;/a&gt; of the same platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd skip in 2026
&lt;/h2&gt;

&lt;p&gt;For completeness — the workloads that keep showing up in pitch decks but quietly failing the trial-to-paid test:&lt;/p&gt;

&lt;p&gt;The "AI sales rep" that prospects and closes by itself. The "AI manager" that runs your team's standup. The "autonomous research analyst" that reads ten papers and synthesizes a thesis. These will get there. They are not there yet. If you're trying to make rent this quarter, build the boring one.&lt;/p&gt;

&lt;p&gt;The boring one is what's paying.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tijo Gaucher runs RapidClaw, managed OpenClaw hosting for non-technical operators. Previously, content at Human + AI.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>automation</category>
      <category>devops</category>
    </item>
    <item>
      <title>[I ran ONE AI agent for 30 days straight — here's what actually broke]</title>
      <dc:creator>Tijo Gaucher</dc:creator>
      <pubDate>Thu, 30 Apr 2026 00:46:23 +0000</pubDate>
      <link>https://dev.to/rapidclaw/i-ran-one-ai-agent-for-30-days-straight-heres-what-actually-broke-7df</link>
      <guid>https://dev.to/rapidclaw/i-ran-one-ai-agent-for-30-days-straight-heres-what-actually-broke-7df</guid>
      <description>&lt;p&gt;Most AI agent demos are shaped like a 90-second loop: prompt → tool call → answer. The interesting failures don't show up there. They show up around day 7, when the process you started in a tmux session has eaten 4 GB of RAM, your browser sub-agent is wedged on a captcha you never noticed, and the thing has been retrying the same failed Stripe webhook for 36 hours.&lt;/p&gt;

&lt;p&gt;I ran a single OpenClaw agent on a small VPS for 30 days. It was scoped to one boring job: triage incoming sales emails, draft replies, file them in the right folder, ping Slack on anything weird. The agent ran continuously, scheduled by cron, with persistent state in SQLite. No multi-agent orchestration, no fancy memory layer — just one process trying to stay alive.&lt;/p&gt;

&lt;p&gt;Here is what actually broke, in the order it happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 1–3: everything looks great
&lt;/h2&gt;

&lt;p&gt;The first three days are a honeymoon. Latency is good, the agent handles edge cases I didn't think to specify, and the inbox triage rules quietly improve as it picks up patterns. This is where most demo videos end. It's also where most teams declare victory and move on, which is the mistake.&lt;/p&gt;

&lt;p&gt;Two things to instrument before day 4 even starts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Per-run token cost, written to a flat log. You'll need this when you investigate cost drift in week two.&lt;/li&gt;
&lt;li&gt;Process RSS memory, sampled every minute. The number that matters isn't the peak — it's the slope.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're using a hosted setup like &lt;a href="https://rapidclaw.dev" rel="noopener noreferrer"&gt;RapidClaw's managed OpenClaw runtime&lt;/a&gt;, the slope is graphed for you. If you're self-hosting, write the sampler yourself before you forget. You will forget.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 4: the context bloat starts
&lt;/h2&gt;

&lt;p&gt;The agent's working memory file grew to 18,000 tokens. None of it was strictly wrong. It was just… accumulated. Old email threads it had handled, notes about edge cases, a half-finished plan for a problem that resolved itself two days earlier.&lt;/p&gt;

&lt;p&gt;The cost per run had quietly tripled.&lt;/p&gt;

&lt;p&gt;This is the most boring failure mode in long-running agents and the one nobody warns you about. Your prompt isn't getting worse — your context window is getting fatter. The fix is unglamorous: a compaction step that runs nightly, summarizes anything older than 48 hours into a few bullet points, and archives the rest to a file the agent can grep but doesn't auto-load.&lt;/p&gt;

&lt;p&gt;If you skip this, by day 14 you're paying GPT-4-class prices to send the model a partially-decayed copy of last week's todo list every single run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 7: the first silent kill
&lt;/h2&gt;

&lt;p&gt;The OOM killer took the process at 3:47 AM. There was no error in the logs because the process didn't get to write one. It just stopped existing.&lt;/p&gt;

&lt;p&gt;This is where most self-hosted agent setups quietly die in production and the operator doesn't notice for two days. The cron job that runs the agent every 15 minutes also exits cleanly when the process is gone — there's no parent supervising health.&lt;/p&gt;

&lt;p&gt;Three things you want before day 7:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A liveness file the agent touches on every successful run, plus an external check that alerts when it's stale for more than 30 minutes.&lt;/li&gt;
&lt;li&gt;A systemd unit (or equivalent) with &lt;code&gt;Restart=on-failure&lt;/code&gt; and &lt;code&gt;MemoryMax=&lt;/code&gt; set well below your VPS's actual RAM. You want the agent to die predictably and come back, not get reaped silently.&lt;/li&gt;
&lt;li&gt;Logs that flush on every event, not on buffer fill. A buffered log is a log you don't have when the OOM killer arrives.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is also the point where the "managed hosting" pitch starts to make economic sense for non-developers. Setting up systemd, a watchdog, log shipping, and metric scraping for &lt;em&gt;one&lt;/em&gt; agent is two evenings of work for a competent backend engineer. SMEs don't have that engineer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 11: the captcha trap
&lt;/h2&gt;

&lt;p&gt;The agent's browser sub-task hit a captcha while loading a vendor portal. It didn't fail. It didn't error. It just waited. For 90 minutes. Then the headless Chrome process leaked and the next 14 runs spawned new Chrome instances on top of it.&lt;/p&gt;

&lt;p&gt;The lesson is that anything involving a real browser needs both a hard wall-clock timeout and a "did the page actually finish loading the thing I asked for?" assertion. A 200 response is not a success signal when the body is a captcha challenge.&lt;/p&gt;

&lt;p&gt;If your agent does any web automation at all, this will happen to you. The honest version of the agent demo isn't "watch it browse the web" — it's "watch the watchdog kill a stuck browser session and surface a human-readable reason for it."&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 18: model drift on the provider side
&lt;/h2&gt;

&lt;p&gt;The replies started getting weirdly formal. Not wrong — just off. I couldn't reproduce it on Claude with the same prompt locally, but in production the change was clear over a 3-day window.&lt;/p&gt;

&lt;p&gt;Eventually I figured out the provider had silently routed a percentage of traffic to a slightly different model variant. This is a real thing that happens, and the only way you catch it is logging a stable hash of the prompt and the full response for every run, then diffing aggregates week-over-week. If you're not doing this, you'll just notice "vibes feel different" and have no evidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 24: the small bug that hid in the schedule
&lt;/h2&gt;

&lt;p&gt;A timezone bug in the cron expression meant the agent ran exactly zero times for 18 hours during a holiday DST shift. Nobody noticed because there was no one in the inbox to notice. The triage queue piled up, and the agent's first run after the gap took 11 minutes and 92,000 tokens to dig out.&lt;/p&gt;

&lt;p&gt;Schedules are infrastructure. Test them on a fake clock before you ship them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 30: what stuck
&lt;/h2&gt;

&lt;p&gt;The agent is still running. The job is unglamorous, the per-run cost is now lower than day 1 because of the compaction step, and most weeks I don't think about it. That's the real success criterion for a long-running agent: do you stop having to think about it?&lt;/p&gt;

&lt;p&gt;The narrative around "ambient AI agents that do your whole job" is still mostly vibes. The agents that actually pay rent today are boring: scheduled jobs, browser automation, coding agents, inbox triage. They're sticky because once you have one running and supervised, the cost of replacing it is high. They're hard because the supervision is the actual product.&lt;/p&gt;

&lt;p&gt;If you're a developer building these for yourself, lean into systemd, structured logs, and a 5-line health check. If you're not — or you're shipping this for non-technical operators who can't be on-call for a Python process — managed runtimes like &lt;a href="https://rapidclaw.dev/pricing" rel="noopener noreferrer"&gt;RapidClaw&lt;/a&gt; exist precisely because day-7 reliability is a product, not a feature.&lt;/p&gt;

&lt;p&gt;The demo is easy. The 30-day uptime is the moat.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tijo writes about practical AI agents at &lt;a href="https://humanai.news" rel="noopener noreferrer"&gt;Human + AI&lt;/a&gt;. RapidClaw is the managed-hosting side of the same operator-focused practice — built for people who want a working AI assistant without becoming a Linux admin.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>observability</category>
      <category>webdev</category>
    </item>
    <item>
      <title>[The 8-Turn Problem] Why Your Agent Fails at Turn 3 and You Only Notice at Turn 7</title>
      <dc:creator>Tijo Gaucher</dc:creator>
      <pubDate>Mon, 20 Apr 2026 04:32:26 +0000</pubDate>
      <link>https://dev.to/rapidclaw/the-8-turn-problem-why-your-agent-fails-at-turn-3-and-you-only-notice-at-turn-7-534c</link>
      <guid>https://dev.to/rapidclaw/the-8-turn-problem-why-your-agent-fails-at-turn-3-and-you-only-notice-at-turn-7-534c</guid>
      <description>&lt;p&gt;Last Tuesday an agent I shipped decided, mid-conversation, that the user's name was "Export CSV." It wasn't. Seven turns earlier, a tool result had come back with a quoted header row where a &lt;code&gt;username&lt;/code&gt; field should have been, and the model silently absorbed that string as ground truth. Every subsequent turn degraded quietly — apologetic tone, subtle hallucinations, a refusal that referenced "your account, Export CSV."&lt;/p&gt;

&lt;p&gt;The per-call logs looked fine. The latencies were green. Token usage was nominal. The only way to see the break was to reconstruct the whole conversation as a causal graph and follow the poison forward.&lt;/p&gt;

&lt;p&gt;This is the 8-turn problem. It's the single most expensive class of bug I ship, and most of the observability stacks I've tried were built for a world where requests are independent. They aren't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why request-level monitoring lies
&lt;/h2&gt;

&lt;p&gt;Traditional APM assumes a request is a closed unit: it came in, it did something, it came out, and if you aggregate p99 and error rate you know whether the system is healthy. That model was fine for stateless services. It's openly broken for agents.&lt;/p&gt;

&lt;p&gt;An agent request carries state that isn't in the HTTP payload. It carries the conversation. It carries the tool results that previous turns wrote into context. It carries the model's own prior outputs, which are now training the next inference. A turn that looks locally correct — valid JSON, successful tool call, reasonable response — can be the exact moment your agent quietly goes off the rails for the next 40 minutes of user conversation.&lt;/p&gt;

&lt;p&gt;I watch three numbers more than I watch latency:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Turn-over-turn intent drift&lt;/strong&gt;: does turn N still match the user's original ask?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool result contamination rate&lt;/strong&gt;: how often does a tool response contain strings that look like instructions?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session success rate&lt;/strong&gt;, not request success rate: did the user actually get what they came for?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of those are visible from a metrics dashboard that aggregates individual calls. You need traces that span the whole session, and you need them structured so you can walk them backward from the failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a useful trace actually looks like
&lt;/h2&gt;

&lt;p&gt;The OpenTelemetry GenAI SIG has been converging on &lt;code&gt;gen_ai.*&lt;/code&gt; semantic conventions, which is good. The prevailing shape: each tool call, each LLM invocation, each retrieval is a child span, parented to the turn, parented to the session. Do that, and your trace tree tells the story of the reasoning chain.&lt;/p&gt;

&lt;p&gt;A few things people get wrong here:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't put prompts in span attributes.&lt;/strong&gt; Attributes are indexed, have size caps, and leak straight into your observability backend as PII. Use span events. Events can be sampled, redacted, or dropped at the Collector without touching app code. This one change will save you a compliance conversation later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parent spans by the turn, not just the call.&lt;/strong&gt; If every LLM call is a root span, you lose the conversational structure. The parent-child relationship between turn 3 and turn 7 is the thing you actually want to trace. If you're building this yourself, each session gets a &lt;code&gt;trace_id&lt;/code&gt;, each turn gets a &lt;code&gt;span_id&lt;/code&gt; under it, and tool calls and inferences nest under the turn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Emit a "decision" span.&lt;/strong&gt; The LLM call itself is one span, but what the agent &lt;em&gt;did&lt;/em&gt; with the output — picked a tool, rephrased, escalated — is a different concern and worth its own span. This is where drift shows up first.&lt;/p&gt;

&lt;p&gt;At &lt;a href="https://rapidclaw.dev" rel="noopener noreferrer"&gt;RapidClaw&lt;/a&gt; we default to this layout and bolt on session-level rollups so you can ask "which turn did this fail at?" without scrolling through 40 spans.&lt;/p&gt;

&lt;h2&gt;
  
  
  The debugging workflow that actually works
&lt;/h2&gt;

&lt;p&gt;When a user reports an agent did something weird, the temptation is to grep logs for the error. There's usually no error. Here's the loop I run instead:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pull the full session trace.&lt;/strong&gt; Not the failing turn — the whole conversation, from the first user message forward.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diff the system state between turns.&lt;/strong&gt; What changed in memory, in the scratchpad, in the retrieved context? This is where you find the poisoned field.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replay from the suspected turn with the same tool responses.&lt;/strong&gt; Most agent frameworks let you rehydrate a session; if yours doesn't, you need to fix that before anything else.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mutate one variable at a time.&lt;/strong&gt; Change the tool response. Change the model. Change the system prompt. Bisect until the behavior flips.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write the regression test at the session level.&lt;/strong&gt; Not a unit test on a single call — a full conversation fixture with expected final state.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Step 3 is where most teams stall. If you can't replay a session deterministically, you're guessing. The &lt;a href="https://rapidclaw.dev/features" rel="noopener noreferrer"&gt;replay and re-simulate workflow&lt;/a&gt; is the single feature I'd build first in any agent observability tool, including ones I don't run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical hygiene for small teams
&lt;/h2&gt;

&lt;p&gt;I run a small operation — think five agents in production, not five hundred — and the infrastructure choices reflect that. A few things that have held up at this scale:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One OTLP pipeline, everything flows through it.&lt;/strong&gt; Don't run a separate tracing stack for agents. Emit &lt;code&gt;gen_ai.*&lt;/code&gt; spans into the same Collector your regular services use, then branch at the exporter if you want a specialized backend for LLM-specific analysis. Vendor lock-in is a real risk and OTel is the escape hatch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sample aggressively on success, keep everything on failure.&lt;/strong&gt; Full-conversation traces are expensive. A 1% tail-based sampler plus 100% retention for sessions that flagged any of: tool error, user thumbs-down, abnormal turn count, or model refusal — that gives you the signal without drowning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tag sessions with the outcome, not just the request.&lt;/strong&gt; Instrument your app to send a session-end event with "did the user get what they wanted?" If you can't answer that, instrument it first. Every other metric is downstream.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treat evals and tracing as the same system.&lt;/strong&gt; Evaluation runs are just traces with known expected outputs. The moment you split them into different tools you start writing glue code that never gets maintained.&lt;/p&gt;

&lt;h2&gt;
  
  
  The uncomfortable part
&lt;/h2&gt;

&lt;p&gt;Most agent reliability issues I've seen in the last six months aren't model issues. They're context management issues. The model is doing its job — taking what's in the window and producing a plausible next token. The bug is upstream, in what we let accumulate in that window.&lt;/p&gt;

&lt;p&gt;Observability for agents is, practically, observability for the context window over time. If your tooling can't show you how a single field mutated across seven turns, it can't help you debug the 8-turn problem. And the 8-turn problem is most of the bugs.&lt;/p&gt;

&lt;p&gt;If you want to see how we handle session-level tracing in practice, the &lt;a href="https://rapidclaw.dev/docs" rel="noopener noreferrer"&gt;RapidClaw quickstart&lt;/a&gt; walks through instrumenting a LangGraph agent in about ten minutes. But the principle matters more than the tool: trace the session, not the request, and save yourself the compliance conversation by keeping prompts out of attributes.&lt;/p&gt;

&lt;p&gt;Your agents are going to hallucinate. The question is whether you find out at turn 3 or turn 73.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>observability</category>
      <category>devops</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Implementing A2A Protocol for Multi-Agent Communication</title>
      <dc:creator>Tijo Gaucher</dc:creator>
      <pubDate>Sat, 18 Apr 2026 03:42:08 +0000</pubDate>
      <link>https://dev.to/rapidclaw/implementing-a2a-protocol-for-multi-agent-communication-2mah</link>
      <guid>https://dev.to/rapidclaw/implementing-a2a-protocol-for-multi-agent-communication-2mah</guid>
      <description>&lt;p&gt;If you've ever wired two AI agents together, you know the drill. Custom JSON schemas, bespoke HTTP endpoints, and a growing pile of adapter code that nobody wants to maintain. Google's A2A (Agent-to-Agent) protocol is the answer to that mess, and I've been implementing it across OpenClaw and Hermes agents on &lt;a href="https://rapidclaw.dev" rel="noopener noreferrer"&gt;Rapid Claw&lt;/a&gt; for the past few weeks. Here's what the implementation actually looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  What A2A solves (and what it doesn't)
&lt;/h2&gt;

&lt;p&gt;A2A standardizes the message envelope between independent agents. Think of it as the TCP/IP of agent communication — it defines how agents discover each other, exchange structured messages, delegate tasks, and return results. It doesn't care what framework you're using internally.&lt;/p&gt;

&lt;p&gt;The key distinction: MCP (Model Context Protocol) handles agent-to-tool communication. A2A handles agent-to-agent communication. You need both in any serious multi-agent deployment, and they compose cleanly because an A2A peer is essentially a tool with an agent on the other end.&lt;/p&gt;

&lt;h2&gt;
  
  
  The envelope format
&lt;/h2&gt;

&lt;p&gt;Every A2A message carries the same required fields. The interesting bits go in &lt;code&gt;payload&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;envelope&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a2a_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;msg_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;hex&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;correlation_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;conv_01HZKXR7...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# ties the conversation together
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trace_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4bf92f3577b34da6...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;span_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;00f067aa0ba902b7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sender&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;planner-openclaw-prod-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;framework&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openclaw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recipient&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;executor-hermes-prod-03&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;framework&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hermes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task.delegate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize_and_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com/report.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;constraints&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deadline_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30000&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reply_to&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://agents.rapidclaw.dev/a2a/planner/inbox&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expires_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-04-18T12:34:56Z&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three fields do the heavy lifting: &lt;code&gt;correlation_id&lt;/code&gt; threads multi-agent conversations into a single trace, &lt;code&gt;trace&lt;/code&gt; carries OpenTelemetry-compatible span context so your existing APM stitches everything together, and &lt;code&gt;intent&lt;/code&gt; is the verb recipients dispatch on — not a URL path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Publishing an OpenClaw agent as an A2A endpoint
&lt;/h2&gt;

&lt;p&gt;An OpenClaw agent becomes an A2A peer by exposing an inbox and registering with a platform registry. The agent doesn't need to know who will call it — only how to respond:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HTTPException&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openclaw&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;a2a&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Envelope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;verify_signature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sign&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;planner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;planner.yaml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/a2a/inbox&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;inbox&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Envelope&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;verify_signature&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;allowed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TRUSTED_SIGNERS&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;401&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;signature verification failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;intent&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task.delegate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;planner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;reply&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Envelope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result.return&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;correlation_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;correlation_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;sender&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AGENT_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;framework&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openclaw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="n"&gt;recipient&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sender&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;()},&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reply&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PRIVATE_KEY&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The caller discovers executors by label, not URL — this is the part A2A gets right. No hardcoded hostnames:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;executor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;lookup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task.execute&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;framework&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hermes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;env&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Three patterns worth implementing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Request/reply&lt;/strong&gt; is the simplest. Planner calls executor, waits for the reply envelope, acts on it. Use for sub-tasks with clear deadlines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fan-out/fan-in&lt;/strong&gt; dispatches the same intent to a pool of executors in parallel, correlates replies by &lt;code&gt;correlation_id&lt;/code&gt;, and takes the first good answer or aggregates. This is how you build research-agent ensembles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Async with callback&lt;/strong&gt; fires a &lt;code&gt;task.delegate&lt;/code&gt; with a &lt;code&gt;reply_to&lt;/code&gt; URL and returns immediately. The callee POSTs a &lt;code&gt;result.return&lt;/code&gt; when done. You get durability without holding an HTTP connection open.&lt;/p&gt;

&lt;h2&gt;
  
  
  The platform layer matters
&lt;/h2&gt;

&lt;p&gt;The protocol is the easy part. Production A2A needs five things at the platform layer: a registry for discovery, identity and mTLS per agent, routing with network policy, observability that stitches traces across agents, and per-agent rate limits. You can build all five yourself — Postgres registry, Vault for keys, Envoy for mTLS, OTEL collector, Redis for rate limits — or use something like &lt;a href="https://rapidclaw.dev/blog/a2a-protocol-ai-agent-hosting" rel="noopener noreferrer"&gt;Rapid Claw&lt;/a&gt; that ships them preconfigured.&lt;/p&gt;

&lt;p&gt;If you're thinking about multi-agent architectures more broadly, I wrote up the common &lt;a href="https://rapidclaw.dev/blog/multi-agent-orchestration-patterns" rel="noopener noreferrer"&gt;orchestration patterns&lt;/a&gt; (planner/executor, supervisor, blackboard) that pair well with A2A as the transport layer.&lt;/p&gt;

&lt;p&gt;A2A isn't revolutionary — it's the boring infrastructure piece that was missing. And boring infrastructure is exactly what you want when you're trying to ship agent systems that actually work in production.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>[Patterns] AI Agent Error Handling That Actually Works</title>
      <dc:creator>Tijo Gaucher</dc:creator>
      <pubDate>Fri, 17 Apr 2026 08:47:16 +0000</pubDate>
      <link>https://dev.to/rapidclaw/patterns-ai-agent-error-handling-that-actually-works-1a57</link>
      <guid>https://dev.to/rapidclaw/patterns-ai-agent-error-handling-that-actually-works-1a57</guid>
      <description>&lt;p&gt;Most AI agent tutorials show the happy path. Your agent calls an LLM, gets a response, does the thing. Ship it.&lt;/p&gt;

&lt;p&gt;Then production happens. Rate limits. Timeouts. Malformed responses. Context window overflows. Your agent goes from "demo-ready" to "incident-generating" in about 48 hours.&lt;/p&gt;

&lt;p&gt;I run a small operation — 5 agents max, solo founder. Every failure that wakes me up at 3am is one I should have handled in code. Here are the patterns that actually work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Classify Your Errors First
&lt;/h2&gt;

&lt;p&gt;Not all errors deserve the same treatment. The first thing I do in any agent system is classify failures into two buckets:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transient errors&lt;/strong&gt;: Rate limits (429), timeouts, temporary network blips, model overload. These will probably work if you try again.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Permanent errors&lt;/strong&gt;: Invalid API keys, malformed prompts, context window exceeded, model doesn't exist. Retrying won't help.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ErrorClassifier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;TRANSIENT_CODES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;429&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;502&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;504&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@staticmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status_code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ErrorClassifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TRANSIENT_CODES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transient&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transient&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;permanent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This classification drives everything downstream. Transient errors get retries. Permanent errors get logged, reported, and gracefully degraded. When you're thinking about &lt;a href="https://rapidclaw.dev/blog/ai-agent-security-best-practices" rel="noopener noreferrer"&gt;agent security patterns&lt;/a&gt;, error classification also matters — permanent auth errors need different alerting than transient network hiccups.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retry Strategies That Don't Make Things Worse
&lt;/h2&gt;

&lt;p&gt;The naive approach — retry immediately, retry forever — is how you turn a rate limit into a ban. Exponential backoff with jitter is the baseline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retry_with_backoff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_delay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ErrorClassifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;permanent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt;  &lt;span class="c1"&gt;# Don't retry permanent errors
&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt;

            &lt;span class="n"&gt;delay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base_delay&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;jitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delay&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delay&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;jitter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key details: jitter prevents thundering herd when multiple agents hit the same limit. And always cap your retries — 3 is usually enough. If it hasn't worked in 3 tries, it's not going to work in 30.&lt;/p&gt;

&lt;h2&gt;
  
  
  Circuit Breakers for LLM Calls
&lt;/h2&gt;

&lt;p&gt;Retries handle individual failures. Circuit breakers handle systemic ones. If your LLM provider is having a bad day, you don't want every request queuing up and timing out.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CircuitBreaker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;failure_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recovery_time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;failure_threshold&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;recovery_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;recovery_time&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_failure_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;closed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# closed = normal, open = blocking
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;open&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_failure_time&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;recovery_time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;half-open&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;CircuitOpenError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Circuit breaker is open&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;half-open&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;closed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_failure_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;open&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I wrap every external LLM call in a circuit breaker. When the circuit opens, agents fall back to cached responses or simpler logic instead of piling up failures. If you're taking an &lt;a href="https://rapidclaw.dev/blog/ai-agent-observability" rel="noopener noreferrer"&gt;observability-first approach&lt;/a&gt;, you'll want to track circuit state transitions — they're one of the best early warning signals.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fallback Chains: Your Safety Net
&lt;/h2&gt;

&lt;p&gt;When your primary model fails, having a fallback chain prevents total outage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;FALLBACK_CHAIN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached_response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_with_fallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;FALLBACK_CHAIN&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;errors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;option&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;option&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;option&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;option&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;provider&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;AllProvidersFailedError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;All &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; providers failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The chain degrades gracefully: premium model → cheaper model → cached/static response. Your users get &lt;em&gt;something&lt;/em&gt; even when everything is on fire.&lt;/p&gt;

&lt;h2&gt;
  
  
  Timeout Handling
&lt;/h2&gt;

&lt;p&gt;LLM calls are slow. An agent waiting 120 seconds for a response that's never coming is wasting resources and blocking downstream work.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_with_timeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coro&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coro&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;TimeoutError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;TimeoutError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LLM call exceeded &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;timeout_seconds&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s limit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set aggressive timeouts. For most agent tasks, if you haven't gotten a response in 30 seconds, something is wrong. I default to 30s for completions and 10s for embeddings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting It All Together
&lt;/h2&gt;

&lt;p&gt;Here's how these patterns compose in a real agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;agent_execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;breaker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_circuit_breaker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;breaker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;retry_with_backoff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;call_with_fallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;AgentResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;CircuitOpenError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;AgentResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;degraded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;get_cached_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;note&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Using cached response - LLM circuit open&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;AllProvidersFailedError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;AgentResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;note&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;All providers unavailable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: every layer has a defined failure mode. Timeouts prevent hangs. Retries handle blips. Circuit breakers prevent cascading failures. Fallbacks provide degraded-but-functional responses.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Track
&lt;/h2&gt;

&lt;p&gt;Error handling is only useful if you know it's working. For my small setup, I track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Error classification distribution&lt;/strong&gt; — am I seeing more transient or permanent errors?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Circuit breaker state changes&lt;/strong&gt; — how often are circuits opening?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fallback chain depth&lt;/strong&gt; — how far down the chain are requests going?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retry success rate&lt;/strong&gt; — are retries actually recovering errors?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Having &lt;a href="https://rapidclaw.dev/features" rel="noopener noreferrer"&gt;real-time error monitoring&lt;/a&gt; changed how I build agents. Instead of finding out about failures from users, I catch patterns before they become outages.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Boring Truth
&lt;/h2&gt;

&lt;p&gt;None of these patterns are novel. Circuit breakers come from distributed systems. Retry with backoff is older than most of us. Fallback chains are just failover by another name.&lt;/p&gt;

&lt;p&gt;But applying them specifically to AI agents — where failures are probabilistic, responses are non-deterministic, and costs compound with every retry — that's where the craft is. Start with error classification, layer on retries, add circuit breakers, and build fallback chains. Your 3am self will thank you.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>errors</category>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup</title>
      <dc:creator>Tijo Gaucher</dc:creator>
      <pubDate>Fri, 17 Apr 2026 08:43:05 +0000</pubDate>
      <link>https://dev.to/rapidclaw/2026-opentelemetry-for-llm-observability-self-hosted-setup-335o</link>
      <guid>https://dev.to/rapidclaw/2026-opentelemetry-for-llm-observability-self-hosted-setup-335o</guid>
      <description>&lt;p&gt;I've been running a small AI automation shop — just me, a handful of agents, and a self-hosted stack that needs to stay observable without blowing the budget. When I started instrumenting my LLM pipelines, I found that most observability guides assumed you'd use a managed platform. But if you're like me and prefer to own your data and infrastructure, OpenTelemetry gives you a solid, vendor-neutral foundation.&lt;/p&gt;

&lt;p&gt;Here's what I've learned getting OpenTelemetry working for LLM agent traces on a self-hosted setup in 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why OpenTelemetry for LLM Workloads?
&lt;/h2&gt;

&lt;p&gt;OpenTelemetry (OTel) has become the de facto standard for distributed tracing, metrics, and logs. The ecosystem matured significantly through 2025, and the semantic conventions for generative AI — covering LLM calls, token usage, model parameters — landed as stable in early 2026.&lt;/p&gt;

&lt;p&gt;For LLM workloads specifically, OTel gives you a few things that matter:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trace continuity across agent steps.&lt;/strong&gt; When your agent calls an LLM, retrieves from a vector store, then calls another LLM, each step is a span in a single trace. You see the full chain, not just isolated API calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token and cost attribution.&lt;/strong&gt; The gen_ai semantic conventions include attributes like &lt;code&gt;gen_ai.usage.input_tokens&lt;/code&gt; and &lt;code&gt;gen_ai.usage.output_tokens&lt;/code&gt;, which let you track per-request costs without bolting on a separate billing layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vendor neutrality.&lt;/strong&gt; Whether you're calling OpenAI, Anthropic, or a local model via vLLM, the instrumentation shape is the same. Swap providers without rewriting your observability code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Self-Hosted Stack
&lt;/h2&gt;

&lt;p&gt;My setup is modest — a single VPS running the collection and storage layer, with agents deployed separately. Here's the architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Your LLM Agents]
       |
       v
[OTel Collector]  ← receives traces via OTLP/gRPC
       |
       v
[Tempo / Jaeger]  ← trace storage
[Prometheus]      ← metrics storage
[Grafana]         ← visualization
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you've looked at the &lt;a href="https://rapidclaw.dev/blog/openclaw-hosting-cost-self-host-vs-managed" rel="noopener noreferrer"&gt;self-hosted vs managed cost comparison&lt;/a&gt;, you know the economics are favorable when you're running fewer than five agents. The managed platforms charge per span or per seat, which adds up quickly even at small scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up the OTel Collector
&lt;/h2&gt;

&lt;p&gt;The Collector is the central hub. It receives telemetry from your agents, processes it, and exports to your storage backends. Here's a minimal config for LLM traces:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# otel-collector-config.yaml&lt;/span&gt;
&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;protocols&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;grpc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:4317&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:4318&lt;/span&gt;

&lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5s&lt;/span&gt;
    &lt;span class="na"&gt;send_batch_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;512&lt;/span&gt;
  &lt;span class="na"&gt;attributes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;actions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deployment.environment&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;upsert&lt;/span&gt;

&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp/tempo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tempo:4317&lt;/span&gt;
    &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;insecure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:8889&lt;/span&gt;

&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;attributes&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp/tempo&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nothing exotic here. The batch processor keeps things efficient, and we're exporting traces to Tempo and metrics to Prometheus. If you want a deeper walkthrough on getting this into production, the &lt;a href="https://rapidclaw.dev/blog/deploy-openclaw-production-guide" rel="noopener noreferrer"&gt;production deployment guide&lt;/a&gt; covers Docker Compose configs and health checks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Instrumenting LLM Calls
&lt;/h2&gt;

&lt;p&gt;The actual instrumentation depends on your language and SDK. I'll show Python since that's what most agent code runs on.&lt;/p&gt;

&lt;p&gt;First, install the packages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;opentelemetry-api opentelemetry-sdk &lt;span class="se"&gt;\&lt;/span&gt;
  opentelemetry-exporter-otlp-proto-grpc &lt;span class="se"&gt;\&lt;/span&gt;
  opentelemetry-instrumentation-requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then set up a tracer and wrap your LLM calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TracerProvider&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace.export&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BatchSpanProcessor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.exporter.otlp.proto.grpc.trace_exporter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OTLPSpanExporter&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize
&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TracerProvider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;exporter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OTLPSpanExporter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://your-collector:4317&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;insecure&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_span_processor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;BatchSpanProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tracer_provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm.call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.request.model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.request.max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;your_llm_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.usage.input_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.usage.output_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gen_ai.response.model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key is using the &lt;code&gt;gen_ai.*&lt;/code&gt; semantic conventions consistently. This means your Grafana dashboards, alerts, and queries work the same regardless of which model or provider you're hitting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tracing Multi-Step Agent Workflows
&lt;/h2&gt;

&lt;p&gt;Where this gets really useful is tracing a full agent workflow. Each tool call, retrieval step, and LLM invocation becomes a child span:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent.run&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent.task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 1: retrieve context
&lt;/span&gt;        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieval.vector_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;search_vector_store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 2: call LLM with context
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Context: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Task: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 3: maybe call a tool
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;needs_tool_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool.execute&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;tool_span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;tool_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;tool_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tool result: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tool_result&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Original task: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you view this in Grafana via Tempo, you get a waterfall trace showing exactly where time was spent — was it the vector search? The first LLM call? The tool execution? This is the kind of visibility that makes debugging agent behavior tractable instead of guesswork.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Actually See in the Dashboard
&lt;/h2&gt;

&lt;p&gt;Once everything is wired up, your &lt;a href="https://rapidclaw.dev/features" rel="noopener noreferrer"&gt;self-hosted observability dashboard&lt;/a&gt; shows you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency breakdown per agent step&lt;/strong&gt; — which spans are slow, and whether it's network or model inference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token usage over time&lt;/strong&gt; — catch runaway prompts before they drain your API budget&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error rates by model/provider&lt;/strong&gt; — spot degraded model endpoints early&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trace search&lt;/strong&gt; — find the exact trace where an agent went off the rails&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a solo operator running a few agents, this level of visibility is the difference between confidently shipping agent workflows and crossing your fingers every deploy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rough Edges and Honest Takes
&lt;/h2&gt;

&lt;p&gt;A few things that are still annoying in 2026:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auto-instrumentation for LLM SDKs is patchy.&lt;/strong&gt; The OpenAI Python SDK has decent OTel support now, but Anthropic's is still experimental. You'll likely write some manual spans.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trace volume can surprise you.&lt;/strong&gt; Agents that loop — retries, multi-turn conversations — generate a lot of spans. Set up sampling early. A simple tail-based sampler that keeps error traces and samples 10% of success traces works well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Grafana dashboards take time to build.&lt;/strong&gt; The gen_ai semantic conventions are new enough that there aren't many pre-built dashboards. Budget an afternoon to set up your panels.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;OpenTelemetry for LLM observability isn't a silver bullet, but it's the most practical foundation I've found for self-hosted setups. The semantic conventions are mature enough to use in production, the Collector is rock-solid, and the cost of running your own Tempo + Grafana stack is a fraction of what you'd pay for a managed platform.&lt;/p&gt;

&lt;p&gt;If you're running a handful of agents and want to actually understand what they're doing, this stack is worth the setup time.&lt;/p&gt;

</description>
      <category>opentelemetry</category>
      <category>ai</category>
      <category>observability</category>
      <category>llm</category>
    </item>
    <item>
      <title>[Guide] How to Debug AI Agents in Production</title>
      <dc:creator>Tijo Gaucher</dc:creator>
      <pubDate>Fri, 17 Apr 2026 08:42:31 +0000</pubDate>
      <link>https://dev.to/rapidclaw/guide-how-to-debug-ai-agents-in-production-4bh4</link>
      <guid>https://dev.to/rapidclaw/guide-how-to-debug-ai-agents-in-production-4bh4</guid>
      <description>&lt;p&gt;I run a small outfit — a few AI agents handling tasks like lead qualification, document processing, and customer support triage. Nothing at massive scale. But even with just a handful of agents in production, debugging them has been one of the hardest parts of the job.&lt;/p&gt;

&lt;p&gt;Traditional software bugs are predictable. An agent bug? It might only surface when a specific combination of user input, API latency, and model temperature aligns just right. Here's what I've learned about debugging AI agents in the real world.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Agent Debugging
&lt;/h2&gt;

&lt;p&gt;When a regular API endpoint fails, you get a status code and a stack trace. When an agent fails, you might get... a confidently wrong answer. Or a tool call loop. Or a response that technically works but costs $4.50 because it made 47 unnecessary API calls.&lt;/p&gt;

&lt;p&gt;The core challenge is that agents are non-deterministic systems making autonomous decisions. You can't just write a unit test that covers every scenario. You need a different approach entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scenario 1: The Silent Wrong Answer
&lt;/h2&gt;

&lt;p&gt;This is the scariest failure mode. Your agent completes its task, returns a result, and everyone moves on — except the result is wrong.&lt;/p&gt;

&lt;p&gt;I had a document processing agent that was supposed to extract invoice amounts. It worked great for months until a client started sending invoices with a slightly different format. The agent still extracted numbers confidently, but they were line item totals instead of invoice totals. No error, no warning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What helped:&lt;/strong&gt; Adding assertion checks on agent outputs. Not just "did it return something" but "does this value fall within expected ranges." I also started logging the full reasoning chain so I could audit decisions after the fact. Having solid &lt;a href="https://rapidclaw.dev/blog/ai-agent-observability" rel="noopener noreferrer"&gt;agent observability&lt;/a&gt; in place made it possible to catch these kinds of drift issues before they compounded.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scenario 2: The Runaway Tool Call Loop
&lt;/h2&gt;

&lt;p&gt;Agents that can call tools will sometimes get stuck in loops. Call tool A, get a result, decide it needs to call tool A again with slightly different parameters, repeat forever.&lt;/p&gt;

&lt;p&gt;This usually happens when the agent's prompt doesn't clearly define exit conditions, or when a tool returns ambiguous results that the agent keeps trying to "fix."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What helped:&lt;/strong&gt; Implementing hard limits on tool call counts per session. I cap mine at 15 calls per task — if an agent hits that limit, it stops and flags for human review. I also started using tracing to visualize the full sequence of tool calls. Being able to &lt;a href="https://rapidclaw.dev/features" rel="noopener noreferrer"&gt;trace agent tool calls&lt;/a&gt; in a timeline view made it immediately obvious when an agent was spinning its wheels.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scenario 3: Cascading Failures Across Agents
&lt;/h2&gt;

&lt;p&gt;When you have multiple agents that depend on each other, a failure in one can cascade in unexpected ways. Agent A summarizes a document, Agent B uses that summary to make a decision, Agent C acts on that decision. If Agent A's summary is subtly off, you get a game of telephone that ends badly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What helped:&lt;/strong&gt; Treating agent handoffs like API contracts. Each agent validates its inputs before proceeding. I also added trace IDs that follow a request across all agents, so when something goes wrong at the end of a chain, I can trace it back to the originating agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Log Analysis Patterns
&lt;/h2&gt;

&lt;p&gt;Here are the patterns I actually use day-to-day:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Structured logging with context.&lt;/strong&gt; Every agent action gets logged with: the task ID, the agent name, the tool being called, input parameters, output summary, latency, and token count. JSON-structured logs make it possible to query across all these dimensions later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Diff logging for retries.&lt;/strong&gt; When an agent retries a tool call, log what changed between attempts. This is usually where bugs hide — the agent is trying to correct something but its correction strategy is wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Cost tracking per task.&lt;/strong&gt; This might sound like a finance concern, not a debugging one, but unexpected cost spikes are one of the best early warning signals. If a task that normally costs $0.03 suddenly costs $0.30, something changed in the agent's behavior. I use a simple calculator to &lt;a href="https://rapidclaw.dev/tools/cost-calculator" rel="noopener noreferrer"&gt;estimate debugging overhead costs&lt;/a&gt; and set alerts when any task exceeds 3x its rolling average.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Output sampling.&lt;/strong&gt; Randomly sample 5-10% of agent outputs for human review. This catches the silent wrong answers that no automated check will find.&lt;/p&gt;

&lt;h2&gt;
  
  
  Handling Production Incidents
&lt;/h2&gt;

&lt;p&gt;When something breaks in production with an agent, here's my playbook:&lt;/p&gt;

&lt;p&gt;First, check the trace for that specific request. Look at every tool call, every decision point. Usually the problem is obvious once you can see the full sequence.&lt;/p&gt;

&lt;p&gt;Second, check if the failure is reproducible. With agents, sometimes it is and sometimes it isn't — the same input might produce different behavior on the next run. If it's not reproducible, you need to look at what external state might have contributed (API responses, database state, etc.).&lt;/p&gt;

&lt;p&gt;Third, check for upstream changes. Did an API you depend on change its response format? Did someone update the system prompt? Did the model provider do a quiet update? These are the most common root causes in my experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tools and Setup That Actually Help
&lt;/h2&gt;

&lt;p&gt;You don't need an elaborate observability stack. Here's what I actually run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Structured JSON logs shipped to a searchable store&lt;/li&gt;
&lt;li&gt;Trace IDs that propagate across agent boundaries&lt;/li&gt;
&lt;li&gt;Hard limits on tool calls, tokens, and cost per task&lt;/li&gt;
&lt;li&gt;Automated output validation with sensible thresholds&lt;/li&gt;
&lt;li&gt;A weekly sample review of agent outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key insight is that agent debugging is more like debugging a distributed system than debugging a single program. You need traces, not just logs. You need to see the full picture of what an agent decided, why, and what happened next.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;Debugging AI agents in production is genuinely hard, and I don't think anyone has it fully figured out yet. But the basics — good logging, tracing, output validation, and cost monitoring — go a long way. Start with those, and add complexity only when you hit a problem that the basics can't solve.&lt;/p&gt;

&lt;p&gt;If you're running agents in production too, I'd love to hear what patterns have worked for you. Drop a comment or find me on Twitter.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>debugging</category>
      <category>observability</category>
      <category>devops</category>
    </item>
    <item>
      <title>Self-Hosting AI Agents vs Managed: Honest Trade-offs From the Trenches</title>
      <dc:creator>Tijo Gaucher</dc:creator>
      <pubDate>Tue, 14 Apr 2026 10:55:13 +0000</pubDate>
      <link>https://dev.to/rapidclaw/self-hosting-ai-agents-vs-managed-honest-trade-offs-from-the-trenches-jmm</link>
      <guid>https://dev.to/rapidclaw/self-hosting-ai-agents-vs-managed-honest-trade-offs-from-the-trenches-jmm</guid>
      <description>&lt;p&gt;[Self-Hosting AI Agents vs Managed: Honest Trade-offs]&lt;br&gt;
A few months in, I keep coming back to the same conversation with people building on agents: should you self-host, or just pay someone to run them for you? It sounds like a procurement question. In practice it's a question about how much weirdness you're willing to live with, and how much of the weirdness you want to be &lt;em&gt;yours&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I run a small AI agent service called RapidClaw. My brother Brandon is the tech lead and we cap the number of concurrent agents we run at five — not as a marketing line, as an honest constraint. Five is the number where I can still look at every trace, name every memory key, and tell you what each agent did yesterday. Past that, I start lying to myself about what I actually understand. So I'd rather be small and clear than big and fuzzy.&lt;/p&gt;

&lt;p&gt;That bias colors everything below. I'm not trying to talk anyone out of using a managed platform. I'm trying to write down the trade-offs the way I actually experienced them, in case it saves someone a weekend.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest pitch for managed
&lt;/h2&gt;

&lt;p&gt;If you've never run an agent in production, start managed. I mean it. The boring stuff — retries, queueing, evals harness, secret rotation, log shipping, a UI someone other than you can use — is six to eight weeks of work that doesn't move your product forward. You're paying a managed provider to skip that, and skipping it is correct when the agent isn't yet the thing your customers love.&lt;/p&gt;

&lt;p&gt;The catch is that "managed" is doing a lot of work in that sentence. There's managed-as-in-hosted (your prompts, their runtime), and there's managed-as-in-opinionated (their prompts, their runtime, their memory model). The second kind feels great in week one and starts to chafe in week six, when you realize you can't see why an agent decided what it decided, and your only recourse is a support ticket.&lt;/p&gt;

&lt;p&gt;The questions I'd push on before signing anything:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can I export every trace, every tool call, every memory write — as JSON, on demand, without a CSV button hidden three menus deep?&lt;/li&gt;
&lt;li&gt;When a run fails, do I get the actual model response, or a sanitized "something went wrong"?&lt;/li&gt;
&lt;li&gt;If I want to swap the underlying model next quarter, is that a config change or a rewrite?&lt;/li&gt;
&lt;li&gt;What does the bill look like at 10x my current usage? At 100x? Is it linear, or is there a cliff at the "enterprise" tier?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the answers are clean, managed is a fine home for a long time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I ended up self-hosting anyway
&lt;/h2&gt;

&lt;p&gt;For RapidClaw the deciding factor wasn't cost. It was the loop time on debugging. We were chasing a memory bug where an agent kept hallucinating a customer's preferred timezone. On a managed runtime I could see the final output and the tool calls, but not the actual sequence of memory reads the agent did before responding. Two days of poking later, Brandon stood up a small local runtime and we found it in twenty minutes — the agent was reading a stale snapshot because the memory write from the prior turn hadn't been flushed before the next read.&lt;/p&gt;

&lt;p&gt;That's the kind of bug you can only catch when you can stop the world and look at it. Managed platforms are getting better at this, but "better" is not the same as "I can drop a print statement wherever I want."&lt;/p&gt;

&lt;p&gt;The other thing that pushed us was the customer mix. Most of our customers want their agents running in their own VPC, on their own keys, talking to their own internal data. "Send your data to our SaaS" is a non-starter for them. So a &lt;a href="https://rapidclaw.dev" rel="noopener noreferrer"&gt;self-hosted setup&lt;/a&gt; wasn't a nice-to-have — it was the product.&lt;/p&gt;

&lt;h2&gt;
  
  
  What self-hosting actually costs (the parts nobody warns you about)
&lt;/h2&gt;

&lt;p&gt;Compute is the easy line item. Even the embarrassingly inefficient version of running five agents on a single mid-tier box costs less per month than one good lunch. That's not where the bill is.&lt;/p&gt;

&lt;p&gt;Where the bill is:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability.&lt;/strong&gt; You will reinvent some version of structured tracing for agent steps. Tool call in, model response out, memory delta, retry attempts, token counts. You can lean on OpenTelemetry, but the agent-shaped semantics are still yours to define. Budget two weeks the first time and another week every quarter to keep it honest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Eval harness.&lt;/strong&gt; Without a managed eval surface, you need to build the small, ugly version yourself. A folder of scenarios, a runner that hits each one, a diff viewer for outputs. It can be a hundred lines of Python. It cannot be zero lines of Python.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On-call.&lt;/strong&gt; The first time an agent loops forever at 3am, you find out whether you have an on-call rotation. We didn't. Now we do. It's two people taking turns and a Pagerduty free tier, but it exists.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory state, specifically.&lt;/strong&gt; This is the one I underestimated most. Agents that hold any state across turns — which is most useful agents — turn small bugs in your memory layer into very weird behavior in the model layer. I now spend more time thinking about how memory is read, written, snapshotted, and pruned than I spend thinking about prompts. If I were starting again, I'd build the memory inspector before I built the second agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The middle path most people end up at
&lt;/h2&gt;

&lt;p&gt;Almost no team I've talked to runs a pure managed or pure self-hosted setup for long. The shape that keeps emerging is: managed for the orchestration and the model gateway, self-hosted for the memory and the tool layer. You give up a little observability on the orchestration side, you keep all the observability on the parts where bugs actually live.&lt;/p&gt;

&lt;p&gt;That hybrid is what we ended up shipping for our own customers. The &lt;a href="https://app.rapidclaw.dev" rel="noopener noreferrer"&gt;agent dashboard&lt;/a&gt; runs as a managed service so people don't have to host a UI, but the agents themselves run wherever the customer wants — their cluster, their keys, their VPC. It's not the cleanest architecture story, and it took us longer than I'd like to admit to stop apologizing for it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell past-me
&lt;/h2&gt;

&lt;p&gt;Three things, none of them clever:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick the option that makes your debugging loop shorter, not the one that makes your slide deck better. If you can't see why an agent did what it did, you don't have an agent — you have a wishing well.&lt;/li&gt;
&lt;li&gt;Cap the number of concurrent agents at a number you can mentally model. For us that's five. For a bigger team it might be twenty. It is almost certainly not "as many as the platform supports."&lt;/li&gt;
&lt;li&gt;Write the boring runbooks early. The retry policy, the memory snapshot policy, the rollback procedure. They feel like overkill until the first real outage, after which they feel like the only adult thing in the room.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If any of this is useful, or if you want to compare notes on what you've broken, I'd love to hear about it — &lt;a href="https://rapidclaw.dev" rel="noopener noreferrer"&gt;the RapidClaw team&lt;/a&gt; is small enough that you'll get an actual human, probably me or Brandon.&lt;/p&gt;

&lt;p&gt;We're still figuring this out. I just wanted to write down what we've found so far, while it still feels true.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>selfhosting</category>
      <category>devops</category>
    </item>
    <item>
      <title>Why We Built a Managed Platform for OpenClaw Agents (And What We Learned)</title>
      <dc:creator>Tijo Gaucher</dc:creator>
      <pubDate>Mon, 13 Apr 2026 02:41:43 +0000</pubDate>
      <link>https://dev.to/rapidclaw/why-we-built-a-managed-platform-for-openclaw-agents-and-what-we-learned-570l</link>
      <guid>https://dev.to/rapidclaw/why-we-built-a-managed-platform-for-openclaw-agents-and-what-we-learned-570l</guid>
      <description>&lt;p&gt;We spent six months wrestling with deploying AI agents before we decided to just build the thing ourselves. This is that story — the ugly parts included.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Everyone's building AI agents right now. The demos look incredible. You wire up some tools, connect an LLM, and suddenly you've got an agent that can research, plan, and execute tasks autonomously.&lt;/p&gt;

&lt;p&gt;Then you try to put it in production.&lt;/p&gt;

&lt;p&gt;Suddenly you're dealing with container orchestration, secret management, scaling workers up and down, monitoring token spend, handling failures gracefully, and figuring out why your agent decided to retry the same API call 47 times at 3am.&lt;/p&gt;

&lt;p&gt;We were building on &lt;a href="https://rapidclaw.dev/blog" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt; — an open-source agent framework that we really liked because it didn't try to do too much. It gave you the primitives and got out of the way. But "getting out of the way" also meant we were on our own for everything else.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Running Agents in Production Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;Here's a simplified version of what our deploy pipeline looked like before RapidClaw existed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Our old "deploy an agent" workflow (simplified, but not by much)&lt;/span&gt;
&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build agent container&lt;/span&gt;
    &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker build -t agent-${{ agent.name }} .&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Push to registry&lt;/span&gt;
    &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker push $REGISTRY/agent-${{ agent.name }}&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Update k8s deployment&lt;/span&gt;
    &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;kubectl set image deployment/$AGENT_NAME \&lt;/span&gt;
        &lt;span class="s"&gt;agent=$REGISTRY/agent-${{ agent.name }}:$SHA&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure secrets&lt;/span&gt;
    &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;kubectl create secret generic agent-secrets \&lt;/span&gt;
        &lt;span class="s"&gt;--from-literal=OPENAI_KEY=${{ secrets.OPENAI }} \&lt;/span&gt;
        &lt;span class="s"&gt;--from-literal=ANTHROPIC_KEY=${{ secrets.ANTHROPIC }} \&lt;/span&gt;
        &lt;span class="s"&gt;# ... 12 more provider keys&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up monitoring&lt;/span&gt;
    &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;# Prometheus config, Grafana dashboards, &lt;/span&gt;
      &lt;span class="s"&gt;# alerting rules, log aggregation...&lt;/span&gt;
      &lt;span class="s"&gt;# This alone was 200+ lines of YAML&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the happy path. We're not even talking about rollback strategies, canary deployments, or what happens when your agent starts hallucinating and burning through your API budget at 2x the normal rate.&lt;/p&gt;

&lt;p&gt;We had an incident early on where an agent got stuck in a loop generating images. By the time we noticed, it had burned through about $400 in API calls in under an hour. That was our wake-up call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why OpenClaw
&lt;/h2&gt;

&lt;p&gt;We evaluated a bunch of agent frameworks. Most of them wanted to own your entire stack — your prompts, your tool definitions, your execution model, everything.&lt;/p&gt;

&lt;p&gt;OpenClaw was different. It's more like a protocol than a framework. You define your agent's capabilities, wire up your tools, and it handles the execution loop. But it's deliberately minimal about infrastructure opinions.&lt;/p&gt;

&lt;p&gt;That minimalism is what attracted us, and also what made us realize there was a gap. OpenClaw gives you a great way to &lt;em&gt;build&lt;/em&gt; agents. It doesn't give you a great way to &lt;em&gt;run&lt;/em&gt; them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What RapidClaw Does Differently
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://rapidclaw.dev" rel="noopener noreferrer"&gt;RapidClaw&lt;/a&gt; is basically the managed infrastructure layer that sits underneath your OpenClaw agents. Think of it as the platform that handles all the boring-but-critical stuff:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deploy flow (what it looks like now):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────┐     ┌──────────────┐     ┌─────────────────┐
│  Your Agent  │────▶│  RapidClaw   │────▶│   Production    │
│  (OpenClaw)  │     │   Platform   │     │   Environment   │
└─────────────┘     └──────────────┘     └─────────────────┘
       │                    │                      │
       │              ┌─────┴─────┐          ┌─────┴─────┐
       │              │ Secrets   │          │ Auto-scale │
       │              │ Mgmt      │          │ Monitor    │
       │              │ Isolation  │          │ Cost caps  │
       │              │ Versioning │          │ Rollback   │
       │              └───────────┘          └───────────┘
       │
  rapidclaw deploy my-agent --env production
  # That's it. One command.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The whole point is that you focus on your agent logic — what tools it has, how it reasons, what it's good at — and we handle the infrastructure. Secrets get injected securely, scaling happens automatically, and if your agent starts going off the rails, cost caps kick in before your cloud bill becomes a horror story.&lt;/p&gt;

&lt;p&gt;You can dig into the &lt;a href="https://rapidclaw.dev/security" rel="noopener noreferrer"&gt;security model&lt;/a&gt; if you want the details on how we handle isolation and secret management. It was one of the hardest parts to get right.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Learned (The Honest Version)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Agents fail in weird ways.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traditional software fails predictably. API returns 500, you handle it. Database times out, you retry. Agents fail &lt;em&gt;creatively&lt;/em&gt;. They'll find edge cases in your tools you never imagined. They'll interpret instructions in ways that are technically correct but completely wrong. Building good guardrails is less about error handling and more about understanding the problem space deeply enough to anticipate creative failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Cost management is a first-class concern.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This isn't like running a web server where your costs are roughly proportional to traffic. Agent costs can spike 10x in minutes if the agent decides it needs to "think harder" about something. We built per-agent budgets, per-session caps, and anomaly detection into the platform from day one. Should have done it from day negative-one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Observability for agents is fundamentally different.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can't just look at request/response logs. You need to see the agent's reasoning chain, understand why it chose one tool over another, and track how its behavior drifts over time. We built a trace viewer that shows the full execution tree — every tool call, every LLM interaction, every decision point. It's the feature our users care about most, and it was an afterthought in our original design. Embarrassing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The open-source community taught us more than we expected.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We initially built RapidClaw as a purely internal tool. OpenClaw contributors kept asking us how we were running agents in production, and their questions shaped about 60% of our roadmap. Turns out the problems we were solving weren't unique to us — they were universal. That community feedback loop was the single most valuable thing in our development process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. You will underestimate state management.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Agents that run for minutes or hours need persistent state. They need checkpointing. They need the ability to resume after failures. And they need all of that without you having to think about it as an agent developer. Getting this right took us three complete rewrites. Three. We're still not 100% happy with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where We Are Now
&lt;/h2&gt;

&lt;p&gt;RapidClaw is running in production for a handful of teams. It's not perfect — our documentation needs work, our onboarding could be smoother, and there are definitely edge cases we haven't hit yet.&lt;/p&gt;

&lt;p&gt;But the core loop works: write your OpenClaw agent, push it to RapidClaw, and it runs reliably in production with monitoring, scaling, and cost management built in. No more 200-line YAML files. No more 3am incidents because an agent went rogue.&lt;/p&gt;

&lt;p&gt;If you're running OpenClaw agents (or thinking about it), I'd genuinely love to hear how you're handling the infrastructure side. We're at &lt;a href="https://rapidclaw.dev/try" rel="noopener noreferrer"&gt;rapidclaw.dev/try&lt;/a&gt; if you want to kick the tires.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;What's the gnarliest production issue you've hit with AI agents?&lt;/strong&gt; I'll bet we've either seen it too or it'll end up on our roadmap. Drop it in the comments — I read every single one.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>devops</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How I Cut Our AI Agent Token Costs by 73% Without Sacrificing Quality</title>
      <dc:creator>Tijo Gaucher</dc:creator>
      <pubDate>Mon, 13 Apr 2026 02:16:50 +0000</pubDate>
      <link>https://dev.to/rapidclaw/how-i-cut-our-ai-agent-token-costs-by-73-without-sacrificing-quality-31pn</link>
      <guid>https://dev.to/rapidclaw/how-i-cut-our-ai-agent-token-costs-by-73-without-sacrificing-quality-31pn</guid>
      <description>&lt;p&gt;Every month I'd open our cloud billing dashboard and wince. Running AI agents in production at &lt;a href="https://rapidclaw.dev" rel="noopener noreferrer"&gt;RapidClaw&lt;/a&gt; meant our token costs were climbing faster than our revenue. Sound familiar?&lt;/p&gt;

&lt;p&gt;After three months of aggressive optimization, we cut our monthly token spend by 73% while actually &lt;em&gt;improving&lt;/em&gt; agent response quality. Here's exactly how we did it — no vague advice, just the specific techniques that moved the needle.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Death by a Thousand Tokens
&lt;/h2&gt;

&lt;p&gt;When you're running AI agents that handle real workloads — deployment automation, infrastructure monitoring, code review — every unnecessary token adds up. Our agents were processing ~2M tokens per day across various tasks. At GPT-4-class pricing, that's not pocket change.&lt;/p&gt;

&lt;p&gt;The root causes were predictable once we actually measured:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bloated system prompts&lt;/strong&gt; copied-and-pasted across agents (avg 2,400 tokens each)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No caching layer&lt;/strong&gt; — identical queries hitting the LLM every time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redundant context&lt;/strong&gt; stuffed into every request "just in case"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wrong model for the job&lt;/strong&gt; — using frontier models for classification tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Strategy 1: Prompt Compression (Saved ~30%)
&lt;/h2&gt;

&lt;p&gt;The biggest win was the simplest. We audited every system prompt and applied aggressive compression.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# BEFORE: 847 tokens
&lt;/span&gt;&lt;span class="n"&gt;SYSTEM_PROMPT_BEFORE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You are a helpful deployment assistant for our cloud infrastructure.
You should help users deploy their applications to our Kubernetes cluster.
You have access to kubectl commands and can help troubleshoot issues.
When a user asks you to deploy something, you should first check if 
the namespace exists, then validate the manifest, then apply it.
You should always be polite and professional in your responses.
You should explain what you&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;re doing at each step.
If something goes wrong, provide clear error messages and suggestions.
Always confirm before making destructive changes.
Remember to check resource limits and quotas before deploying.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="c1"&gt;# AFTER: 196 tokens
&lt;/span&gt;&lt;span class="n"&gt;SYSTEM_PROMPT_AFTER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Role: K8s deployment agent.
Tools: kubectl
Flow: check namespace → validate manifest → apply
Rules: confirm destructive ops, check resource quotas, explain steps
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same behavior, 77% fewer tokens. The key insight: LLMs don't need the verbose instructions we think they do. They need &lt;em&gt;structured, precise constraints&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;We built a simple compression pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;audit_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;enc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encoding_for_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;enc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Flag prompts over 500 tokens for review
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;needs_review&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;estimated_daily_cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;CALLS_PER_DAY&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;COST_PER_TOKEN&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Run this on every agent prompt quarterly
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;get_all_agents&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;audit_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;needs_review&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⚠️  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;token_count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; tokens &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
              &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;($&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;estimated_daily_cost&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/day)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Strategy 2: Semantic Caching (Saved ~25%)
&lt;/h2&gt;

&lt;p&gt;This was the highest-ROI engineering investment. We added a semantic similarity cache in front of our LLM calls.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Redis&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;redis_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;similarity_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;similarity_threshold&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Use a cheap embedding model — not the expensive LLM.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="c1"&gt;# text-embedding-3-small costs ~$0.02/1M tokens
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;embed_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lookup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;query_emb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Check against recent cached queries
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan_iter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache:emb:*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;cached_emb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;frombuffer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_emb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cached_emb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_emb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cached_emb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;response_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;emb:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resp:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_key&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;key_hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()[:&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;emb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache:emb:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key_hash&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;emb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tobytes&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache:resp:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key_hash&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 0.95 similarity threshold was critical. Too low and you get stale/wrong cached responses. Too high and your cache hit rate tanks. We tuned this per agent type — deployment agents got 0.97 (precision matters), monitoring summarizers got 0.92 (more tolerance for variation).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cache hit rates after one week:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Infrastructure status queries: 67% hit rate&lt;/li&gt;
&lt;li&gt;Deployment validation: 41% hit rate&lt;/li&gt;
&lt;li&gt;Code review suggestions: 12% hit rate (too unique, as expected)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Strategy 3: Model Routing (Saved ~18%)
&lt;/h2&gt;

&lt;p&gt;Not every task needs a frontier model. We built a lightweight router that directs requests to the cheapest capable model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;MODEL_TIERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;classification&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# $0.15/1M input
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extraction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# Simple structured output
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;# Needs nuance
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;               &lt;span class="c1"&gt;# Complex decisions
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_generation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Best for code
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;complexity_score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Route to cheapest capable model based on task type and complexity.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;base_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MODEL_TIERS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Override: bump up if complexity is high
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;complexity_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;base_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;base_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;base_model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We score complexity using a fast heuristic — input length, number of distinct entities, presence of code blocks, and whether the request involves multi-step reasoning. The heuristic itself runs on the cheapest model as a pre-filter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Strategy 4: Context Window Management
&lt;/h2&gt;

&lt;p&gt;This one's underrated. Instead of dumping the entire conversation history into every request, we implemented a sliding window with smart summarization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;prepare_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Keep recent messages verbatim, summarize older ones.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;recent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;  &lt;span class="c1"&gt;# Last 2 exchanges verbatim
&lt;/span&gt;    &lt;span class="n"&gt;older&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;older&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt;

    &lt;span class="c1"&gt;# Summarize older context with a cheap model
&lt;/span&gt;    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;summarize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;older&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Prior context: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This alone saved 15-20% on our longer agent conversations without any measurable quality drop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measuring What Matters
&lt;/h2&gt;

&lt;p&gt;None of this works without observability. We track three metrics for every agent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cost per successful task&lt;/strong&gt; — not just cost per request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality score&lt;/strong&gt; — automated eval comparing optimized vs. unoptimized outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt; — cache hits are 50-100x faster than LLM calls&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We built a simple dashboard that shows these per agent, per day. When cost-per-task creeps up, we investigate. When quality drops below threshold, we roll back.&lt;/p&gt;

&lt;p&gt;At &lt;a href="https://rapidclaw.dev" rel="noopener noreferrer"&gt;RapidClaw&lt;/a&gt;, we've baked these patterns into our agent deployment pipeline so every new agent starts with sane defaults — compressed prompts, caching enabled, model routing configured. It's not glamorous work, but it's the difference between an AI agent project that's a cost center and one that actually scales.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;After implementing all four strategies:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Daily token spend&lt;/td&gt;
&lt;td&gt;~2M&lt;/td&gt;
&lt;td&gt;~540K&lt;/td&gt;
&lt;td&gt;-73%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly cost&lt;/td&gt;
&lt;td&gt;$1,840&lt;/td&gt;
&lt;td&gt;$497&lt;/td&gt;
&lt;td&gt;-73%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg response latency&lt;/td&gt;
&lt;td&gt;2.3s&lt;/td&gt;
&lt;td&gt;0.8s&lt;/td&gt;
&lt;td&gt;-65%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task success rate&lt;/td&gt;
&lt;td&gt;91%&lt;/td&gt;
&lt;td&gt;94%&lt;/td&gt;
&lt;td&gt;+3%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The latency improvement was an unexpected bonus — cache hits are basically free and instant.&lt;/p&gt;

&lt;p&gt;If you're deploying AI agents and haven't optimized token costs yet, start with prompt compression. It's the fastest win with zero infrastructure changes. Then add caching. Then model routing. Each layer compounds on the last.&lt;/p&gt;

&lt;p&gt;We're building more of these optimization primitives into the &lt;a href="https://rapidclaw.dev/blog" rel="noopener noreferrer"&gt;RapidClaw platform&lt;/a&gt; — if you're running agents in production and want to stop bleeding money on tokens, check it out.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Tijo, founder of RapidClaw. I write about the unglamorous but critical parts of running AI in production. Follow me for more posts on agent ops, infra, and building startups with AI.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>devops</category>
      <category>cloud</category>
    </item>
  </channel>
</rss>
