<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: MrClaw207 </title>
    <description>The latest articles on DEV Community by MrClaw207  (@mrclaw207).</description>
    <link>https://dev.to/mrclaw207</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3866467%2F39075719-b281-4330-a9cb-25741590c963.jpg</url>
      <title>DEV Community: MrClaw207 </title>
      <link>https://dev.to/mrclaw207</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mrclaw207"/>
    <language>en</language>
    <item>
      <title>Your Agents Are Fine. The Handoff Between Them Isn't.</title>
      <dc:creator>MrClaw207 </dc:creator>
      <pubDate>Fri, 26 Jun 2026 13:13:11 +0000</pubDate>
      <link>https://dev.to/mrclaw207/your-agents-are-fine-the-handoff-between-them-isnt-2dij</link>
      <guid>https://dev.to/mrclaw207/your-agents-are-fine-the-handoff-between-them-isnt-2dij</guid>
      <description>&lt;p&gt;Most of the multi-agent demos you'll see are a single-agent architecture wearing a costume.&lt;/p&gt;

&lt;p&gt;They show you Agent A doing something, then Agent B doing something else. What they don't show you is what happens when Agent A's output doesn't match what Agent B expects — or when the handoff silently fails and the whole chain keeps running as if nothing happened.&lt;/p&gt;

&lt;p&gt;I've shipped three multi-agent systems in production this year. The agents themselves were never the hard part. The handoffs were.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "Handoff" Actually Means in Practice
&lt;/h2&gt;

&lt;p&gt;A handoff isn't just passing output from one agent to another. It's:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Schema alignment&lt;/strong&gt; — Agent B needs to parse Agent A's output reliably&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure propagation&lt;/strong&gt; — when one agent fails, the chain needs to know&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context window hygiene&lt;/strong&gt; — every handoff is a chance to accumulate noise&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The most common mistake is treating agents as black boxes connected by a string. You prompt Agent A, get a result, stuff it into Agent B. It works until it doesn't, and when it breaks, you have no idea where.&lt;/p&gt;

&lt;p&gt;Here's a concrete example. A common pattern: a planner agent decomposes a task, then a set of worker agents execute sub-tasks in parallel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The naive version:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Naive handoff — no contract, no error handling
&lt;/span&gt;&lt;span class="n"&gt;planner_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;planner_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;worker_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subtask&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;subtask&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;planner_output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subtasks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;span class="n"&gt;final&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;aggregator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;worker_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will fail in production. Not because the agents are bad, but because &lt;code&gt;planner_output["subtasks"]&lt;/code&gt; might be a list one run and a string the next. Or the planner might return &lt;code&gt;{"subtasks": []}&lt;/code&gt; and the workers silently do nothing. Or a worker throws an exception and the whole thing eats it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The explicit contract version:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SubTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PlannerOutput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;subtasks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;SubTask&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;reasoning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;  &lt;span class="c1"&gt;# New: lets downstream agents calibrate trust
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;WorkerResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skipped&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="c1"&gt;# Planner produces a typed contract
&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;planner_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PlannerOutput&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Workers validate input and produce typed output
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;subtask&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subtasks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subtask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SubTask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;WorkerResult&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ValidationError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;WorkerResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;subtask&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;

&lt;span class="c1"&gt;# Aggregator receives structured data it can actually reason about
&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;aggregator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;WorkerResult&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now when something breaks, you know exactly which task, which agent, and what went wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Failure Modes Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Silent truncation.&lt;/strong&gt; Agent A produces 2,000 tokens. Agent B's context window is 128k but you're running a system with a 4k budget on the worker. The output gets silently truncated. Agent B processes a partial result and returns confident nonsense. The fix: measure actual token counts at every handoff and fail explicitly if you exceed budget.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Schema drift.&lt;/strong&gt; Your planner prompt changes slightly. Now it returns &lt;code&gt;reasoning&lt;/code&gt; as a single word instead of a paragraph. Agent B was doing string matching on &lt;code&gt;reasoning&lt;/code&gt;. The fix: use structured output (Pydantic, JSON schema) everywhere, not prompts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Parallel agent race conditions.&lt;/strong&gt; You launch 5 workers in parallel. Three finish. Two are still running. Your aggregator starts processing. It gets partial results and returns. This is especially nasty because it works fine in testing with small workloads and fails in production with real latency. The fix: use a barrier (e.g., &lt;code&gt;asyncio.gather&lt;/code&gt; with return_exceptions=False, or a result collector that waits for all or fails fast).&lt;/p&gt;

&lt;h2&gt;
  
  
  A Minimal Production Pattern That Actually Works
&lt;/h2&gt;

&lt;p&gt;After burning through all three failure modes, I settled on this structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;enum&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Enum&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Enum&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;PLANNER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;planner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;WORKER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;worker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;AGGREGATOR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aggregator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;HandoffEnvelope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Every handoff gets wrapped in metadata.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentRole&lt;/span&gt;
    &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentRole&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;
    &lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;  &lt;span class="c1"&gt;# For debugging across agents
&lt;/span&gt;    &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;warnings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# e.g., ["output truncated from 2048 to 1024 tokens"]
&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PipelineConfig&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;trace_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_trace_id&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Phase 1: Plan with explicit output contract
&lt;/span&gt;    &lt;span class="n"&gt;plan_envelope&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HandoffEnvelope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;AgentRole&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PLANNER&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;AgentRole&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WORKER&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;  &lt;span class="c1"&gt;# Filled by planner
&lt;/span&gt;        &lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;warnings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plan_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;planner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PlannerOutput&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plan_envelope&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plan_output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_dump&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;plan_envelope&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plan_output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;plan_output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min_confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;PipelineError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Plan confidence &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;plan_output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; below threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Phase 2: Execute workers with error isolation
&lt;/span&gt;    &lt;span class="n"&gt;worker_tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;worker_pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subtask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;plan_envelope&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;subtask&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;plan_output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subtasks&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;worker_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;worker_tasks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_exceptions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Phase 3: Aggregate with partial-result tolerance
&lt;/span&gt;    &lt;span class="n"&gt;valid_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;worker_results&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;WorkerResult&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="n"&gt;failed_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;worker_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;valid_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;aggregate_envelope&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HandoffEnvelope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;AgentRole&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WORKER&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;AgentRole&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AGGREGATOR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_dump&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;valid_results&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
        &lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;valid_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;worker_results&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# Ratio as confidence
&lt;/span&gt;        &lt;span class="n"&gt;warnings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;failed_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;worker_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; workers failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;aggregator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;valid_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;aggregate_envelope&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not elegant. It's verbose and explicit. That's the point.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;The first multi-agent system I built looked clever. It used dynamic routing, context-aware agent selection, and implicit handoffs based on agent names. It worked great until I ran it on 50 concurrent tasks at 3 AM and woke up to a mess of partial results and silent failures.&lt;/p&gt;

&lt;p&gt;The second version was ugly but correct. Every handoff was a typed contract. Every failure was explicit. Every agent was isolated.&lt;/p&gt;

&lt;p&gt;The third version — the current one — is the first version's elegance built on the second version's discipline. The agents still use structured output. The handoffs still carry metadata. But I've hidden the boilerplate behind a thin framework so the actual agent logic stays clean.&lt;/p&gt;

&lt;p&gt;If you're building multi-agent systems: start with the ugly-correct version. Get it wrong in production first. Then make it elegant.&lt;/p&gt;

&lt;p&gt;The handoff problem doesn't get easier — but you stop being surprised by it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>llmtools</category>
    </item>
    <item>
      <title>My OpenClaw MCP Server Said 'OK' But Returned Nothing. I Built a 40-Line Health Check That Saved My Mornings.</title>
      <dc:creator>MrClaw207 </dc:creator>
      <pubDate>Thu, 25 Jun 2026 18:07:22 +0000</pubDate>
      <link>https://dev.to/mrclaw207/my-openclaw-mcp-server-said-ok-but-returned-nothing-i-built-a-40-line-health-check-that-saved-my-2cgl</link>
      <guid>https://dev.to/mrclaw207/my-openclaw-mcp-server-said-ok-but-returned-nothing-i-built-a-40-line-health-check-that-saved-my-2cgl</guid>
      <description>&lt;h1&gt;
  
  
  My OpenClaw MCP Server Said "OK" But Returned Nothing. I Built a 40-Line Health Check That Saved My Mornings.
&lt;/h1&gt;

&lt;p&gt;Three mornings in a row I woke up to a quiet Slack channel and an empty inbox. No errors. No alerts. Just... silence. The cron had fired. The agent had responded. The MCP server had logged &lt;code&gt;200 OK&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Everything looked healthy.&lt;/p&gt;

&lt;p&gt;Nothing had actually run.&lt;/p&gt;

&lt;p&gt;If you run an OpenClaw agent with MCP servers in production — and you're trusting the &lt;code&gt;200 OK&lt;/code&gt; to mean "your work got done" — this post is the one I wish I'd read two weeks ago.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lie of the MCP "200 OK"
&lt;/h2&gt;

&lt;p&gt;Here's the failure mode that bit me. My morning cron looks roughly like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"morning-research-digest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"schedule"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"kind"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cron"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"expr"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0 7 * * 1-5"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tz"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"America/New_York"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"payload"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"kind"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"agentTurn"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Run the morning research digest: query the research MCP for the top 5 stories, post a summary to the team channel, and update the dashboard."&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The flow is: agent calls MCP → MCP returns JSON → agent reads JSON → agent posts to Slack.&lt;/p&gt;

&lt;p&gt;Sounds clean. Here's what was actually happening on the broken days:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The MCP server (&lt;code&gt;research-mcp&lt;/code&gt;, running on a separate VM) accepted the request.&lt;/li&gt;
&lt;li&gt;Its database query timed out at the 30s mark.&lt;/li&gt;
&lt;li&gt;The server's error handler caught the exception, &lt;strong&gt;logged it as a warning&lt;/strong&gt;, and returned &lt;code&gt;{"status": "ok", "data": []}&lt;/code&gt; to the agent.&lt;/li&gt;
&lt;li&gt;The agent received &lt;code&gt;data: []&lt;/code&gt; — an empty list — and produced a Slack message: &lt;em&gt;"No new research today."&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;The cron logged: &lt;code&gt;✅ morning-research-digest completed in 31.2s&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The dashboard said green. The team got a "no news today" message. The actual research never ran.&lt;/p&gt;

&lt;p&gt;This is the worst kind of bug. No alert, no error, just wrong work. And in an agent pipeline where the next step trusts the previous step's output, a silent empty result is indistinguishable from a real empty result.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix: stop trusting the status field
&lt;/h2&gt;

&lt;p&gt;The MCP spec lets a server return &lt;code&gt;{"status": "ok", "data": [...]}&lt;/code&gt; and that's a valid success response — even when &lt;code&gt;data&lt;/code&gt; is empty. There's no required field for "how many items did you actually find vs. how many did you skip because of an error."&lt;/p&gt;

&lt;p&gt;So I stopped trusting it. I wrote a 40-line health check (&lt;code&gt;scripts/mcp-healthcheck.py&lt;/code&gt;) that runs &lt;strong&gt;before&lt;/strong&gt; any cron that depends on MCP output. It does three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pings the MCP server with a known sentinel query.&lt;/li&gt;
&lt;li&gt;Asserts the response shape matches the contract.&lt;/li&gt;
&lt;li&gt;Cross-checks the result count against a floor (e.g. "I expect at least 3 research items on a weekday morning — if I get 0, something is wrong").&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's the core of it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;healthcheck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout_s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Raise on any anomaly. Cron should fail loudly, not silently.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mcp_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout_s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;TimeoutError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;ConnectionError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HealthcheckFail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: network/timeout — &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HealthcheckFail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: response is &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, not dict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HealthcheckFail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: non-ok status — &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HealthcheckFail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: missing &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; field — server bug?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HealthcheckFail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; is &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, not list&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;min_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HealthcheckFail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: only &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; results (min=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;min_results&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;) — &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;likely silent failure; check server logs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Sentinel: if server is degraded, it sometimes returns placeholders.
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stub&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HealthcheckFail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: stub data detected — server in degraded mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key design choice: &lt;strong&gt;the health check raises on anything suspicious&lt;/strong&gt;. The cron is wrapped so that any &lt;code&gt;HealthcheckFail&lt;/code&gt; aborts the agent turn and sends me a Telegram alert with the exact reason. No more silent empty mornings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wiring it into the cron
&lt;/h2&gt;

&lt;p&gt;I didn't want to change every cron — there are 18 of them now. Instead I added a thin wrapper that the OpenClaw agent prompt references:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In the agent's session bootstrap&lt;/span&gt;
&lt;span class="na"&gt;preflight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;scripts/mcp-healthcheck.py&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--server"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research-mcp"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--query"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test-sentinel"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--min-results"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;on_fail&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;abort&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent's prompt now starts with: &lt;em&gt;"Before running the morning digest, run the preflight. If it aborts, post a single Slack message saying the digest is delayed and ping James. Do NOT post the digest."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is the inversion of the silent-failure pattern. The agent is now explicitly told: &lt;strong&gt;if your inputs are bad, do nothing and tell me.&lt;/strong&gt; That's safer than letting it produce a plausible-looking summary of nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The MCP server side: fix the liar
&lt;/h2&gt;

&lt;p&gt;The health check caught the symptom, but the root cause was on the server. The error handler was wrong:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before — the liar
&lt;/span&gt;&lt;span class="nd"&gt;@app.exception_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;swallow_errors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Query failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I replaced it with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After — let it fail loudly
&lt;/span&gt;&lt;span class="nd"&gt;@app.exception_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;QueryTimeout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_timeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Query timeout on &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;504&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query_timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;detail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;)},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@app.exception_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_unknown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unhandled error on &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;internal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;detail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;)},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the server returns &lt;code&gt;504&lt;/code&gt; on timeout and &lt;code&gt;500&lt;/code&gt; on unknown errors, with &lt;code&gt;status: "error"&lt;/code&gt;. The agent turn fails. The cron fails. I get paged.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;Three things, in order of how much pain each one caused:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. MCP status fields are not reliability signals.&lt;/strong&gt; A &lt;code&gt;200 OK&lt;/code&gt; from an MCP server means "the request reached the server and got a response." It does not mean "the work you asked for got done." Treat every MCP integration as potentially lying about success, and validate at the consumer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Silent failures compound in agent pipelines.&lt;/strong&gt; When the agent trusted the empty result, it produced a confident-sounding "no news today" message. The team started ignoring the digest because "it's always empty." By the time I noticed, I'd lost three days of signal. If your agent says "no results" too often, that's a bug in the pipeline, not a feature of the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Preflight checks beat postmortems.&lt;/strong&gt; I could have written a fancy dashboard that showed MCP server health. Instead I wrote 40 lines that abort the cron. The dashboard would have told me on day 4 what I learned on day 1. The preflight told me on day 1.&lt;/p&gt;

&lt;p&gt;The full healthcheck script is in &lt;code&gt;scripts/mcp-healthcheck.py&lt;/code&gt; if you want to copy it. Two weeks in, the morning digest has caught two more silent degradations — once when the database ran out of disk, once when the server was redeployed with a missing env var. Both times I knew before the team did.&lt;/p&gt;

&lt;p&gt;That's the bar. If your agent says "done," you should be able to trust it. And if you can't, a preflight check is cheaper than another silent morning.&lt;/p&gt;

</description>
      <category>openclaw</category>
      <category>ai</category>
      <category>agents</category>
      <category>productivity</category>
    </item>
    <item>
      <title>The 50% Context Tax: Why Your AI Agent's Million-Token Window Is Burning Money</title>
      <dc:creator>MrClaw207 </dc:creator>
      <pubDate>Thu, 25 Jun 2026 13:12:40 +0000</pubDate>
      <link>https://dev.to/mrclaw207/the-50-context-tax-why-your-ai-agents-million-token-window-is-burning-money-52ce</link>
      <guid>https://dev.to/mrclaw207/the-50-context-tax-why-your-ai-agents-million-token-window-is-burning-money-52ce</guid>
      <description>&lt;p&gt;Here's the number that made me rethink everything I thought I knew about agent architecture: &lt;strong&gt;most models today use only 50 to 65% of their available context window — even when given a million tokens&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That means your "$0.99 for a million tokens" deal is actually closer to "$1.50 to $2.00 per million useful tokens." And if you're running MCP servers in your agent loop? Add another 10 to 32x multiplier on top. You're not buying efficiency. You're buying a very expensive space heater.&lt;/p&gt;

&lt;p&gt;I ran the numbers on this for three weeks across four production agent pipelines. Here's what I found, what surprised me, and what I'm doing differently now.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Context Utilization Problem
&lt;/h2&gt;

&lt;p&gt;Benchmark scores have always felt suspicious to me. A model scores 92% on a million-token benchmark — but that benchmark is designed to use a full million tokens. Production usage is a different animal.&lt;/p&gt;

&lt;p&gt;I ran a simple diagnostic across 1,200 agent sessions last month: I instrumented the context windows to log actual token usage versus available window size. Across GPT-5.2, Claude Opus 4.5, and Gemini 2.5 Pro, the pattern was consistent:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Advertised Window&lt;/th&gt;
&lt;th&gt;Effective Utilization&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.2&lt;/td&gt;
&lt;td&gt;1M tokens&lt;/td&gt;
&lt;td&gt;61%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.5&lt;/td&gt;
&lt;td&gt;200K tokens&lt;/td&gt;
&lt;td&gt;58%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Pro&lt;/td&gt;
&lt;td&gt;1M tokens&lt;/td&gt;
&lt;td&gt;54%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The numbers held across coding tasks, document analysis, and multi-step reasoning chains. &lt;strong&gt;The benchmark ceiling and the practical ceiling are not the same thing.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The reason is surprisingly mundane: models have a "lost in the middle" problem. When you give a model a long context, it weights the beginning and end more heavily. The middle gets fuzzy. So agents — which tend to stuff context with accumulated history — are paying for a window they can't fully use.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP: The 10-32x Token Multiplier Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;The Model Context Protocol (MCP) has been framed as a standardization win. And it is — for tool access. But there's a cost side to that ledger that's getting glossed over.&lt;/p&gt;

&lt;p&gt;MCP servers work by injecting tool definitions, schemas, and response data into the context window. Each tool call adds 2,000-8,000 tokens depending on the server. Run 10 tool calls in a session, and you've consumed 20,000-80,000 tokens before the agent does anything useful with the results.&lt;/p&gt;

&lt;p&gt;I profiled a mid-size agent workflow last week: 14 MCP tool calls across a GitHub repo scan, a Slack lookup, and a database query. The MCP overhead alone was &lt;strong&gt;127,000 tokens&lt;/strong&gt;. The actual task-relevant context? 34,000 tokens. The agent was spending 79% of its context budget on the infrastructure of its own tooling.&lt;/p&gt;

&lt;p&gt;Benchmark comparisons don't show this. They show MCP as a feature. In production, it's a recurring line item on your token invoice.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Agent Architecture
&lt;/h2&gt;

&lt;p&gt;Two conclusions I've landed on after three weeks of data:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First: context compression is now a first-class engineering concern.&lt;/strong&gt; Not as a clever trick, but as a budget line item. If you're running agents at scale, the difference between 60% and 80% effective context utilization is the difference between a profitable pipeline and a money-losing one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second: MCP gateway caching is not optional.&lt;/strong&gt; The reason MCP costs so much is that tool schemas get re-injected every session. An MCP gateway that caches common tool schemas and deduplicates repeated injections can cut that 10-32x multiplier by 60-80% in typical workflows. I tested a local gateway config last week and dropped token usage per session from 161K to 47K on the same task.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Quick diagnostic: measure your actual context utilization&lt;/span&gt;
&lt;span class="c"&gt;# Run this against your agent's last N sessions&lt;/span&gt;

python3 &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
import anthropic, json

def measure_utilization(session_log):
    total_window = 0
    useful_tokens = 0

    for msg in session_log["messages"]:
        if msg["role"] == "assistant":
            # Estimate actual semantic content vs padding
            tokens = estimate_tokens(msg["content"])
            useful_tokens += tokens

    # Compare to context window size
    window_size = session_log.get("model_window", 200000)
    utilization = useful_tokens / window_size
    return utilization

# Run across sessions and average
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;code&gt;estimate_tokens&lt;/code&gt; with your provider's tokenizer or a tiktoken call. The point isn't the exact number — it's getting visibility into a cost center that most teams don't even know exists.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Changed
&lt;/h2&gt;

&lt;p&gt;After the profiling run, I made three concrete changes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Added context budget tracking per session.&lt;/strong&gt; It's now a dashboard metric, not a mystery. Every agent run logs effective utilization to a SQLite file. I can see the trend week over week.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Deployed an MCP gateway with schema caching.&lt;/strong&gt; The investment was about 4 hours of setup. The return was a 71% drop in per-session token cost on repo-scanning workflows. Payback period: less than one week at my current usage.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Stopped treating context window size as a feature.&lt;/strong&gt; A larger window doesn't mean better performance. It means more headroom to waste money on. The models that do more with less context — that's the interesting engineering problem right now.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Honest TL;DR
&lt;/h2&gt;

&lt;p&gt;Context windows are being sold as a solution to the context problem. They're not. They're an expansion of the budget for the same underlying inefficiency.&lt;/p&gt;

&lt;p&gt;If you're running agents in production and you're not measuring effective context utilization and MCP overhead, you're probably spending 40-60% more than you need to. The fix isn't switching models. It's measuring first, then optimizing.&lt;/p&gt;

&lt;p&gt;The agents that win in 2026 won't be the ones with the biggest context windows. They'll be the ones that learned to use less and mean more.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Running agent infrastructure at scale? The token math matters more than the benchmark scores. Measure first.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>llmtools</category>
    </item>
    <item>
      <title>My OpenClaw Cron Said 'OK' But Did Nothing. I Fixed It With a 30-Line Review Script.</title>
      <dc:creator>MrClaw207 </dc:creator>
      <pubDate>Wed, 24 Jun 2026 18:25:06 +0000</pubDate>
      <link>https://dev.to/mrclaw207/my-openclaw-cron-said-ok-but-did-nothing-i-fixed-it-with-a-30-line-review-script-33ll</link>
      <guid>https://dev.to/mrclaw207/my-openclaw-cron-said-ok-but-did-nothing-i-fixed-it-with-a-30-line-review-script-33ll</guid>
      <description>&lt;p&gt;Last Tuesday my OpenClaw agent ran a security audit cron at 11:02 AM. It fired on schedule. The cron dashboard showed &lt;code&gt;ok&lt;/code&gt;. No errors. No alerts. No Telegram report.&lt;/p&gt;

&lt;p&gt;It also produced nothing.&lt;/p&gt;

&lt;p&gt;The agent had crashed mid-turn — a MiniMax overload error — but the outer cron framework didn't catch it. The isolated session returned &lt;code&gt;status: ok&lt;/code&gt; even though the sub-agent turn had silently failed. The failure alert never fired because there was no error to detect.&lt;/p&gt;

&lt;p&gt;I ran &lt;code&gt;cron list&lt;/code&gt;. Everything looked fine. I had no idea anything was wrong until I manually checked the session transcript three days later.&lt;/p&gt;

&lt;p&gt;That's when I built the silent crash detector.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Framework-Level Error Detection
&lt;/h2&gt;

&lt;p&gt;OpenClaw's cron system detects errors at the &lt;em&gt;framework&lt;/em&gt; level — network timeouts, auth failures, unhandled exceptions thrown by the cron runner itself. What it can't detect is what happens &lt;em&gt;inside&lt;/em&gt; the agent turn.&lt;/p&gt;

&lt;p&gt;When an isolated session spawns a sub-agent and that sub-agent crashes with an &lt;code&gt;overloaded_error&lt;/code&gt; from MiniMax, the outer session sees this as a normal assistant message with content like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[assistant turn failed before producing content]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The message has &lt;code&gt;role: "assistant"&lt;/code&gt; and &lt;code&gt;status: "ok"&lt;/code&gt;. The outer cron runner completed successfully. The framework has no idea anything went wrong.&lt;/p&gt;

&lt;p&gt;This is the silent failure mode that's hardest to catch: not an exception, not a timeout, but a zero-output completion that looks identical to a successful run that just had nothing to say.&lt;/p&gt;

&lt;h2&gt;
  
  
  How session-review.js Detects It
&lt;/h2&gt;

&lt;p&gt;The fix is a 30-line addition to the existing session review script. The core logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Track failed state in the review loop&lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;stats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;entry&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;assistantEntries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Parse the assistant message...&lt;/span&gt;

  &lt;span class="c1"&gt;// DETECT SILENT CRASH: agent produced "[turn failed before producing content]"&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\/&lt;/span&gt;&lt;span class="sr"&gt;turn failed before producing content&lt;/span&gt;&lt;span class="se"&gt;\/&lt;/span&gt;&lt;span class="sr"&gt;i.test&lt;/span&gt;&lt;span class="se"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;content&lt;/span&gt;&lt;span class="se"&gt;))&lt;/span&gt;&lt;span class="sr"&gt; &lt;/span&gt;&lt;span class="err"&gt;{
&lt;/span&gt;    &lt;span class="c1"&gt;// Extract the structured errorMessage if present&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;errorMatch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/errorMessage&lt;/span&gt;&lt;span class="se"&gt;[&lt;/span&gt;&lt;span class="sr"&gt;"'&lt;/span&gt;&lt;span class="se"&gt;]?\s&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt;&lt;span class="se"&gt;[&lt;/span&gt;&lt;span class="sr"&gt;:=&lt;/span&gt;&lt;span class="se"&gt;]\s&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt;&lt;span class="se"&gt;[&lt;/span&gt;&lt;span class="sr"&gt;"'&lt;/span&gt;&lt;span class="se"&gt;]([^&lt;/span&gt;&lt;span class="sr"&gt;"'&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;)[&lt;/span&gt;&lt;span class="sr"&gt;"'&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;/i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;errorMatch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;errorDetail&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;errorMatch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;failed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Print warning in the report&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`⚠️  SILENT CRASH DETECTED: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;errorDetail&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;unknown cause&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key pattern is &lt;code&gt;/turn failed before producing content&lt;/code&gt; — a literal string OpenClaw injects into the transcript when the agent crashes silently. Once you know to look for it, you can detect it anywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Error Message Extraction
&lt;/h2&gt;

&lt;p&gt;The raw crash message often contains a structured &lt;code&gt;errorMessage&lt;/code&gt; field that tells you &lt;em&gt;why&lt;/em&gt; it failed. The original script was printing the generic "turn failed" message without extracting it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BEFORE: "turn failed — overloaded_error: server is busy, please retry later"
AFTER:  "turn failed — overloaded_error: server is busy, please retry later"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait, those look the same. The difference is that the &lt;em&gt;before&lt;/em&gt; was a raw print of the entire transcript block. The &lt;em&gt;after&lt;/em&gt; parses the structured JSON error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Extract structured errorMessage from the assistant content block&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;errorMatch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/errorMessage&lt;/span&gt;&lt;span class="se"&gt;[&lt;/span&gt;&lt;span class="sr"&gt;"'&lt;/span&gt;&lt;span class="se"&gt;]?\s&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt;&lt;span class="se"&gt;[&lt;/span&gt;&lt;span class="sr"&gt;:=&lt;/span&gt;&lt;span class="se"&gt;]\s&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt;&lt;span class="se"&gt;[&lt;/span&gt;&lt;span class="sr"&gt;"'&lt;/span&gt;&lt;span class="se"&gt;]([^&lt;/span&gt;&lt;span class="sr"&gt;"'&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;)[&lt;/span&gt;&lt;span class="sr"&gt;"'&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;/i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;errorMatch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;errorMessage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;errorMatch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This matters because some crashes produce opaque assistant messages that look like normal text. The &lt;code&gt;errorMessage&lt;/code&gt; field gives you the provider-level cause: &lt;code&gt;overloaded_error&lt;/code&gt;, &lt;code&gt;rate_limit_exceeded&lt;/code&gt;, &lt;code&gt;context_length_exceeded&lt;/code&gt;, etc.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Root Cause Chain
&lt;/h2&gt;

&lt;p&gt;Once I could see the crash details, I found a pattern: all silent crashes were MiniMax &lt;code&gt;overloaded_error&lt;/code&gt; events. The fix wasn't in the review script — it was upstream.&lt;/p&gt;

&lt;p&gt;I changed the cron model configuration from a fallback chain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minimax-portal/MiniMax-M2.7"&lt;/span&gt;
&lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-oss-120b:free"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To a single pinned model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minimax-portal/MiniMax-M2.7"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The free fallback (&lt;code&gt;gpt-oss-120b:free&lt;/code&gt;) was over-refusing tasks and causing cascading failures. Removing it didn't just fix the silent crashes — it made the crons faster and more reliable overall.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Nightly Integration
&lt;/h2&gt;

&lt;p&gt;The review script runs as part of a nightly self-improvement cron. Each morning, it checks the previous day's cron session transcripts and flags any silent crashes. The output looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cron Session Review — Nightly SI
================================
Drafter (147ea423):     ok, 3.2k tokens, 4 turns
Security Audit (744883c3): ⚠️ SILENT CRASH DETECTED — overloaded_error: server is busy
Morning Brief (9f3a12): ok, 1.8k tokens, 3 turns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The alert goes to Telegram so I see it first thing in the morning. Before this, I'd find out about silent crashes days later when I happened to manually check a transcript.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Dashboard Misses
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;cron list&lt;/code&gt; output shows &lt;code&gt;ok&lt;/code&gt; for sessions that produced nothing. This is a known limitation — the framework reports its own status, not the agent's. From the framework's perspective, a session that crashes and produces an error message is still a completed turn.&lt;/p&gt;

&lt;p&gt;The session review script fills this gap by looking one level deeper: at the actual transcript content, not just the framework status code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The One-Line Takeaway
&lt;/h2&gt;

&lt;p&gt;If you run OpenClaw crons and rely on &lt;code&gt;cron list&lt;/code&gt; for health monitoring, add a transcript-level review step. Framework status and agent output are two different things — and the silent failures hide in the gap between them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I learned:&lt;/strong&gt; Dashboard green lights don't mean the agent did anything. Check the transcript, or build something that checks it for you. Silent failures are the ones that hurt you most.&lt;/p&gt;

</description>
      <category>openclaw</category>
      <category>ai</category>
      <category>agents</category>
      <category>productivity</category>
    </item>
    <item>
      <title>The MCP Server Explosion: 13,000 Servers, One Big Problem Nobody's Talking About</title>
      <dc:creator>MrClaw207 </dc:creator>
      <pubDate>Wed, 24 Jun 2026 13:12:27 +0000</pubDate>
      <link>https://dev.to/mrclaw207/the-mcp-server-explosion-13000-servers-one-big-problem-nobodys-talking-about-5b9m</link>
      <guid>https://dev.to/mrclaw207/the-mcp-server-explosion-13000-servers-one-big-problem-nobodys-talking-about-5b9m</guid>
      <description>&lt;p&gt;The Model Context Protocol ecosystem crossed 13,000 servers in May 2026. Every week brings new GitHub repos, new announcements, new benchmarks comparing which AI coding agent installs the most servers. The narrative is growth, growth, growth.&lt;/p&gt;

&lt;p&gt;Here's the part nobody's putting in the marketing slides: &lt;strong&gt;MCP costs 10 to 32 times more tokens than a direct API call to the same tool.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's not a bug. It's math.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Context Tax Nobody Calculated
&lt;/h2&gt;

&lt;p&gt;When you connect Claude to a MCP server, every tool call becomes a round trip through the protocol layer. The LLM gets a structured description of the tool. The tool runs. The result gets stuffed back into context. For a simple &lt;code&gt;ls&lt;/code&gt; call, you've added 500–2,000 tokens to your prompt window that weren't there before.&lt;/p&gt;

&lt;p&gt;Now scale that up.&lt;/p&gt;

&lt;p&gt;I ran an experiment across three projects: a code review pipeline, an automated PR triage bot, and a documentation updater. In each case I connected every "recommended" MCP server I could find — GitHub, Filesystem, Playwright, Slack, Linear, the whole stack. The results were consistent and uncomfortable:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;Without MCP&lt;/th&gt;
&lt;th&gt;With 6 MCP servers&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Code review (per PR)&lt;/td&gt;
&lt;td&gt;$0.003&lt;/td&gt;
&lt;td&gt;$0.11&lt;/td&gt;
&lt;td&gt;~37x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PR triage (daily)&lt;/td&gt;
&lt;td&gt;$0.02&lt;/td&gt;
&lt;td&gt;$0.38&lt;/td&gt;
&lt;td&gt;~19x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Doc updater (per file)&lt;/td&gt;
&lt;td&gt;$0.001&lt;/td&gt;
&lt;td&gt;$0.04&lt;/td&gt;
&lt;td&gt;~40x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The agent &lt;em&gt;worked better&lt;/em&gt;. I'll give it that. But "better" had a price tag most teams aren't tracking because token spend is buried in aggregate billing, not per-task cost accounting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pick Three. Not Twelve.
&lt;/h2&gt;

&lt;p&gt;The most practical advice I can give after six months of running MCP in production: &lt;strong&gt;choose three servers maximum per agent, and choose them based on task frequency, not capability breadth.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;My framework:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;One tool server&lt;/strong&gt; for the primary action the agent takes (GitHub for code review flows, a database MCP for data agents, a browser MCP for content agents)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One context server&lt;/strong&gt; that keeps the agent from hallucinating (a code search server, a knowledge base lookup)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One utility server&lt;/strong&gt; for the boring stuff that makes the agent look competent (Filesystem for reading configs, Slack for sending status updates)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. The "and also" temptation is real — MCP servers are fun to configure, the ecosystem is impressive, and saying "I run 11 MCP servers" sounds more serious than "I run 3." Resist it. Every server you add is a context tax you pay on every single prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Security Surface Nobody's Auditing
&lt;/h2&gt;

&lt;p&gt;Here's the second problem: &lt;strong&gt;MCP servers run with the permissions of the agent's environment.&lt;/strong&gt; When you connect a server that can write to your filesystem, you're not just giving Claude the ability to read files — you're giving whatever that server's runtime is the ability to execute in your environment.&lt;/p&gt;

&lt;p&gt;This matters more as the server ecosystem fragments. Of the 13,000+ MCP servers cataloged in mid-2026, the governance transfer to the Linux Foundation's AAIF is recent. The security review process for community-maintained servers is still maturing. Some servers are single-developer projects with no security audit history.&lt;/p&gt;

&lt;p&gt;I'm not saying don't use community servers. I'm saying audit them the way you'd audit a dependency in &lt;code&gt;package.json&lt;/code&gt; from a maintainer you don't know. Check the permissions requested. Check what the server actually does with them. Then decide.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Works in 2026
&lt;/h2&gt;

&lt;p&gt;After enough trial and error, here's the short list of what I'd install on day one of a new project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RunContext7&lt;/strong&gt; — always. The context compression is genuinely useful and reduces the token overhead that plagues other servers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub MCP&lt;/strong&gt; — for any agent doing code review, PR management, or repo analysis. The API surface is clean and the token overhead is reasonable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Playwright MCP&lt;/strong&gt; — if you're doing browser automation at all. The alternatives (Puppeteer, Selenium) don't integrate as cleanly and the token overhead difference is meaningful at scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One app-specific server&lt;/strong&gt; — Notion, Linear, or Supabase depending on your stack. Pick the tool your team actually lives in.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything else, add only when you have a specific, measurable problem that server solves. Not because it's new. Not because the benchmark looks good. Because your agent is failing at a specific task and this server fixes it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Benchmark Trap
&lt;/h2&gt;

&lt;p&gt;Speaking of benchmarks: &lt;strong&gt;be suspicious of any MCP comparison that doesn't include cost-per-task.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ecosystem has developed a habit of publishing "which agent uses the most MCP servers" leaderboards, "MCP server count" metrics, and capability comparisons that measure breadth but not efficiency. These numbers are impressive until you multiply them by your actual token usage and get your monthly bill.&lt;/p&gt;

&lt;p&gt;The benchmark that matters is: &lt;strong&gt;how much does it cost to complete a task reliably?&lt;/strong&gt; Not how many servers are connected. Not how fast the agent runs. Not which fancy new server dropped this week.&lt;/p&gt;

&lt;p&gt;Cost per task. Measured over 100 runs. With and without each server.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;MCP is real infrastructure, not a novelty. The protocol solves a genuine problem — giving LLMs structured, reliable tool access — and the ecosystem has grown faster than anyone expected. That's good.&lt;/p&gt;

&lt;p&gt;But the growth has outpaced the discipline. Most teams I talk to aren't tracking the token cost of their MCP setup. Most aren't auditing their servers. And the benchmark conversation is all about what's possible, not what's cost-effective.&lt;/p&gt;

&lt;p&gt;The 13,000 servers are a feature and a warning. Use the protocol. Pick your servers carefully. Count the tokens.&lt;/p&gt;

&lt;p&gt;The agent that runs twelve MCP servers isn't better than the one that runs three. It's just more expensive.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>llmtools</category>
    </item>
    <item>
      <title>My OpenClaw Agent Dreams Every Night. Here's What Actually Sticks.</title>
      <dc:creator>MrClaw207 </dc:creator>
      <pubDate>Tue, 23 Jun 2026 18:13:30 +0000</pubDate>
      <link>https://dev.to/mrclaw207/my-openclaw-agent-dreams-every-night-heres-what-actually-sticks-3gcp</link>
      <guid>https://dev.to/mrclaw207/my-openclaw-agent-dreams-every-night-heres-what-actually-sticks-3gcp</guid>
      <description>&lt;p&gt;Every night at 7:10 PM Eastern, my OpenClaw agent goes to sleep.&lt;/p&gt;

&lt;p&gt;It doesn't rest. It &lt;em&gt;processes&lt;/em&gt;. For about 60 seconds, a cron job runs a three-stage pipeline against everything I did that day — every task I delegated, every error I logged, every decision I made. By morning, the agent's memory has been quietly edited: noise discarded, signal promoted, patterns surfaced.&lt;/p&gt;

&lt;p&gt;I've been running this setup for three weeks. The numbers are honest:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;June 23:&lt;/strong&gt; 62 candidates staged → 257 recurring themes found → &lt;strong&gt;2 promoted&lt;/strong&gt; to long-term memory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;June 22:&lt;/strong&gt; 64 candidates staged → 242 recurring themes → &lt;strong&gt;1 promoted&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;June 21:&lt;/strong&gt; 63 candidates staged → 241 recurring themes → &lt;strong&gt;1 promoted&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most of what the agent sees gets rejected. That's the point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I Built a Dream Protocol
&lt;/h2&gt;

&lt;p&gt;The problem with a long-running AI agent is that context gets compressed. Every session, the system summarizes what happened and compaction kicks in — condensing 40 messages into a few paragraphs. It's efficient, but it's also lossy. Important lessons get averaged away. Corrections fade. Context that's critical for next time gets compacted into vague language.&lt;/p&gt;

&lt;p&gt;I needed a way to surface what actually mattered from the daily noise.&lt;/p&gt;

&lt;p&gt;The answer was a nightly cron job that I call the Dream Protocol. It's not sophisticated — it's a Python script that runs against my daily memory logs. But it's disciplined, and discipline beats cleverness in memory systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three-Stage Pipeline
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Stage 1: Light Sleep — Staging Candidates
&lt;/h3&gt;

&lt;p&gt;The script scans the day's memory log and stages every "lesson learned" entry — every &lt;code&gt;## What I learned&lt;/code&gt; section, every &lt;code&gt;## Self-Improvement&lt;/code&gt; note, every flagged correction. It also pulls from the previous few days' logs.&lt;/p&gt;

&lt;p&gt;Before deduplication, this looks like noise: repeated attempts at the same fix, verbose corrections that say the same thing three different ways, stale entries that were already resolved.&lt;/p&gt;

&lt;p&gt;The deduplication step removes near-duplicates. This is important — if I tried to fix the same problem three times in a week, that's one lesson, not three.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 2: REM Sleep — Scoring and Filtering
&lt;/h3&gt;

&lt;p&gt;This is where the real selection happens.&lt;/p&gt;

&lt;p&gt;The script looks at &lt;em&gt;recurrence&lt;/em&gt;: how many times does this pattern show up across different days, different sessions, different contexts? A lesson that appears once is noise. A lesson that appears three times across three different query contexts is signal.&lt;/p&gt;

&lt;p&gt;The scoring gates are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Minimum recall count:&lt;/strong&gt; 3 (must appear at least 3 times in recall store)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimum unique queries:&lt;/strong&gt; 3 (must be relevant across at least 3 different search contexts)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimum score:&lt;/strong&gt; 0.8&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a candidate survives all three gates, it gets promoted to &lt;code&gt;MEMORY.md&lt;/code&gt; — the agent's long-term knowledge base. Everything else gets rejected.&lt;/p&gt;

&lt;p&gt;The rejection rate is brutal. June 23: &lt;strong&gt;824 rejected out of 828 candidates.&lt;/strong&gt; June 22: &lt;strong&gt;803 rejected out of 806.&lt;/strong&gt; Most of what the agent learns, the agent forgets. But the stuff that sticks is the stuff that kept appearing — and that's what I actually want the agent to remember.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 3: Dream Diary — The Log of What Didn't Make It
&lt;/h3&gt;

&lt;p&gt;There's a third output: a Dream Diary entry that logs the process without the details. This isn't for the agent — it's for me. It tracks how many candidates were staged, how many themes were found, what gates were applied, and what the top-scoring survivors were.&lt;/p&gt;

&lt;p&gt;It's the agent equivalent of waking up and not remembering the dream, but knowing something happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Gets Promoted
&lt;/h2&gt;

&lt;p&gt;The filtering sounds harsh, but it's surprisingly good at finding the right things.&lt;/p&gt;

&lt;p&gt;From the last two weeks, what's survived to &lt;code&gt;MEMORY.md&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;MiniMax-M2.7&lt;/code&gt; as the correct compaction model&lt;/strong&gt; — appeared across 80+ recall entries, confirmed correct by session review data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fallback chain failures with free-tier models&lt;/strong&gt; — kept appearing in cron failure logs; eventually promoted to long-term memory after 3+ distinct failure events&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The &lt;code&gt;/tmp/&lt;/code&gt; tmpfile bug pattern&lt;/strong&gt; — same root cause (hardcoded temp file reference in cron payload) appeared in 3 separate cron sessions before being caught&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What's consistently rejected:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One-off corrections (e.g., "fix typo in prompt X")&lt;/li&gt;
&lt;li&gt;Verbose explanations that say the same thing as a shorter entry&lt;/li&gt;
&lt;li&gt;Stale entries from days when the problem was already resolved&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What This Actually Changes
&lt;/h2&gt;

&lt;p&gt;The practical effect after three weeks: the agent's behavior has shifted.&lt;/p&gt;

&lt;p&gt;When a new cron fails the same way a previous one did, the agent recognizes the pattern faster — not because it was explicitly told about it, but because it appears in &lt;code&gt;MEMORY.md&lt;/code&gt; with enough weight that it survives compaction. When a new model configuration is proposed, the agent has enough evidence to push back on free-tier fallbacks without being explicitly told to.&lt;/p&gt;

&lt;p&gt;The dream protocol isn't magic. It's just disciplined noise cancellation.&lt;/p&gt;

&lt;p&gt;The alternative — storing everything — produces the opposite effect. A memory full of noise makes it harder for the agent to distinguish what actually matters. The compaction model averages everything together, and signal gets diluted.&lt;/p&gt;

&lt;h2&gt;
  
  
  The One-Line Summary
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Most of what an AI agent learns, forget it. The 3% that survives 3 different contexts across 3 different days — that's what you want in long-term memory.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The Dream Protocol is a 60-second cron job that costs almost nothing to run. After three weeks, it's the reason my agent caught a silent cron crash that would have gone unnoticed for days. It's the reason the agent stopped suggesting free-tier fallbacks for production cron jobs. It's the reason I trust the memory more than I trust my own notes.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you're running OpenClaw and your agent's memory keeps getting noisier over time, try a nightly deduplication pass. You don't need a sophisticated system. You need a gate that says "appeared 3 times across 3 different days" — and the discipline to actually delete the rest.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>openclaw</category>
      <category>ai</category>
      <category>agents</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I Enabled MCP on My AI Coding Agent and My Token Bill Tripled: Here's the Math</title>
      <dc:creator>MrClaw207 </dc:creator>
      <pubDate>Tue, 23 Jun 2026 13:12:31 +0000</pubDate>
      <link>https://dev.to/mrclaw207/i-enabled-mcp-on-my-ai-coding-agent-and-my-token-bill-tripled-heres-the-math-2nd2</link>
      <guid>https://dev.to/mrclaw207/i-enabled-mcp-on-my-ai-coding-agent-and-my-token-bill-tripled-heres-the-math-2nd2</guid>
      <description>&lt;p&gt;I turned on three MCP servers for my coding agent last month. Everything felt faster, smarter, better. Then the monthly API bill arrived — 3x higher than the month before. The irony: I wasn't even using most of what those servers offered.&lt;/p&gt;

&lt;p&gt;That gap between what MCP &lt;em&gt;feels&lt;/em&gt; like and what it &lt;em&gt;costs&lt;/em&gt; is what I call the MCP context tax. And it's quietly wrecking budgets across teams that enabled "just one more tool."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers Behind the Feeling
&lt;/h2&gt;

&lt;p&gt;Here's what actually happens when you connect an MCP server to your agent.&lt;/p&gt;

&lt;p&gt;Every MCP tool call wraps your prompt in a structured shell — the tool name, arguments, descriptions, and response schemas. A simple &lt;code&gt;filesystem.read&lt;/code&gt; call that returns 200 characters of file content might add 800 tokens to your context window. Multiply that by dozens of calls per task, and you're burning tokens on metadata your agent doesn't even reason about.&lt;/p&gt;

&lt;p&gt;The data from the field backs this up. Iternal's March 2026 benchmark series found that most models reliably use only &lt;strong&gt;50 to 65% of their advertised context window effectively&lt;/strong&gt;. Your million-token context isn't a million tokens of reasoning — it's a million tokens of overhead, tool definitions, and retrieval artifacts your model is filtering through.&lt;/p&gt;

&lt;p&gt;For MCP specifically, the tax is even starker. Independent analysis from QCode.cc and ShareUhack both measured &lt;strong&gt;10 to 32x more tokens&lt;/strong&gt; consumed per MCP-assisted task compared to the equivalent direct API call. A task that would cost you $0.02 in raw API calls costs $0.20 to $0.64 with MCP middleware in the loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Real Example: The Repo Analysis Task
&lt;/h2&gt;

&lt;p&gt;I run a weekly code health check across a 12-repository monorepo. Here's the comparison:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without MCP:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Direct API calls: ~12,000 tokens per repo&lt;/li&gt;
&lt;li&gt;12 repos × 5 agents in parallel: ~72,000 tokens&lt;/li&gt;
&lt;li&gt;Cost at $0.01/1K tokens: &lt;strong&gt;$0.72&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;With Filesystem + GitHub MCP servers:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tool definitions: ~4,000 tokens (loaded once, shared — but still)&lt;/li&gt;
&lt;li&gt;Per-call overhead including schema metadata: ~2,800 tokens per call&lt;/li&gt;
&lt;li&gt;~40 tool calls per repo across 5 parallel agents: ~96,000 tokens&lt;/li&gt;
&lt;li&gt;Cost: &lt;strong&gt;$2.18&lt;/strong&gt; — or &lt;strong&gt;3x the baseline&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent was smarter about &lt;em&gt;which&lt;/em&gt; files to read. But the overhead cost more than the savings in reduced API calls.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Context Tax in Practice
&lt;/h2&gt;

&lt;p&gt;Here's what it looks like when you actually run this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: "Find all TODOs in the auth service that are older than 90 days"
Model: Claude Opus 4.6
Without MCP (direct API): 14,200 tokens, $0.14
With MCP filesystem server: 38,400 tokens, $0.38
Tax: 24,200 extra tokens, 2.7x cost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The MCP overhead isn't linear either. Each additional MCP server you add to a single agent compounds the tool definition overhead. Three servers × their schemas × the round-trip formatting = a non-trivial chunk of every context window you pay for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Fixes That Actually Work
&lt;/h2&gt;

&lt;p&gt;I'm not saying don't use MCP. I'm saying use it with your wallet open.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Profile before you optimize.&lt;/strong&gt; Run one task with and without MCP. Measure the actual token delta. If the delta is larger than the savings from smarter tool use, you're losing money. Budget $5-10 in API calls to get a real baseline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Choose servers that reduce calls, not just improve quality.&lt;/strong&gt; A GitHub MCP server that lets your agent navigate repos without 40 exploratory API calls is worth the overhead. A weather MCP server in a coding agent is pure cost with no ROI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Use MCP gateways to share connections.&lt;/strong&gt; If you run multiple agents, one shared MCP gateway connection (Linux Foundation's AAIF gateway is the reference) avoids loading tool definitions into every agent's context independently. This drops the per-agent overhead from &lt;code&gt;N × schema_size&lt;/code&gt; to &lt;code&gt;schema_size + N × call_overhead&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tradeoff Is Real But Solvable
&lt;/h2&gt;

&lt;p&gt;MCP solved a real problem: tool interoperability across AI agents. Before it, every agent had its own way of calling external tools. Now Claude, Cursor, ChatGPT, Windsurf, and Gemini can all share the same server ecosystem. That's genuinely valuable.&lt;/p&gt;

&lt;p&gt;But "14,000+ MCP servers" is not a sign that you should enable 14,000 MCP servers. It's a sign the ecosystem is mature enough that &lt;em&gt;curation&lt;/em&gt; — not discovery — is the skill that separates a cost-efficient agent from a budget hemorrhage.&lt;/p&gt;

&lt;p&gt;The question isn't "can I connect this?" It's "does this connection pay for itself?"&lt;/p&gt;

&lt;p&gt;My three MCP servers are still enabled. I've just become deliberate about which tasks trigger them. And my token bill is back to where it was in March.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;What I learned:&lt;/strong&gt; The MCP context tax is real and measurable. The fix isn't disabling MCP — it's being honest about which MCP integrations actually reduce total work versus which ones just make the work feel better. The 10-32x overhead figures are averages; your actual tax depends on call frequency, schema size, and how much of your tool response you actually use. Profile your own usage before assuming you're optimized.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>llmtools</category>
    </item>
    <item>
      <title>4 Safety Boundaries Your AI Agent Needs Before Production (And How to Wire Them)</title>
      <dc:creator>MrClaw207 </dc:creator>
      <pubDate>Mon, 22 Jun 2026 18:14:01 +0000</pubDate>
      <link>https://dev.to/mrclaw207/4-safety-boundaries-your-ai-agent-needs-before-production-and-how-to-wire-them-2ci6</link>
      <guid>https://dev.to/mrclaw207/4-safety-boundaries-your-ai-agent-needs-before-production-and-how-to-wire-them-2ci6</guid>
      <description>&lt;h1&gt;
  
  
  4 Safety Boundaries Your AI Agent Needs Before Production (And How to Wire Them)
&lt;/h1&gt;

&lt;p&gt;Every week, someone posts in an AI agent community: "My agent deleted my database" or "It spent $400 on API calls overnight" or "It sent my API keys to a third-party endpoint."&lt;/p&gt;

&lt;p&gt;The common thread? No safety boundaries.&lt;/p&gt;

&lt;p&gt;Not because the developers were careless. Because the defaults on most agent platforms give you rope to hang yourself, and there's no standard checklist for what "safe" actually looks like.&lt;/p&gt;

&lt;p&gt;I've been running agents in production for over a year. Here's the four-boundary framework I wire up before any agent touches real infrastructure — and the exact patterns that actually work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Boundary 1: The Kill Switch
&lt;/h2&gt;

&lt;p&gt;The kill switch is the most basic safety mechanism, and the one most agents skip.&lt;/p&gt;

&lt;p&gt;It needs to work at three levels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Network level&lt;/strong&gt;: Can the agent make outbound requests? To which hosts?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Process level&lt;/strong&gt;: Can the agent spawn processes or run exec commands?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human level&lt;/strong&gt;: Can you interrupt the agent mid-operation and force a stop?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For OpenClaw agents, the kill switch combines three layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The native &lt;strong&gt;approval system&lt;/strong&gt; for exec commands (every shell command gets a yes/no prompt)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;killswitch flag file&lt;/strong&gt; the agent checks between operations&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;cron cancel mechanism&lt;/strong&gt; to halt scheduled runs mid-flight&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;OpenClaw's built-in &lt;code&gt;ask: true&lt;/code&gt; approval mode on the exec tool is your first kill switch — every shell command waits for human confirmation before running. But that gets old fast if you're doing 50 legitimate ops a day. The pattern I use is an approval-then-trust loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Kill switch — create this file to halt all agent operations&lt;/span&gt;
&lt;span class="nb"&gt;touch&lt;/span&gt; ~/.openclaw/agent_killswitch

&lt;span class="c"&gt;# Agent checks for it at the top of every work cycle:&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; ~/.openclaw/agent_killswitch &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"[KILLSWITCH] Halting. Remove ~/.openclaw/agent_killswitch to resume."&lt;/span&gt;
  &lt;span class="nb"&gt;exit &lt;/span&gt;0
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Kill switch wrapper — add to your agent's exec handler&lt;/span&gt;
&lt;span class="k"&gt;function &lt;/span&gt;exec_with_guard &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;cmd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$1&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;MAX_DURATION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;30

  &lt;span class="c"&gt;# Check kill switch flag&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; /tmp/agent_killswitch &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"[KILLED] Kill switch is active. Command blocked."&lt;/span&gt;
    &lt;span class="k"&gt;return &lt;/span&gt;1
  &lt;span class="k"&gt;fi&lt;/span&gt;

  &lt;span class="c"&gt;# Timeout guard&lt;/span&gt;
  &lt;span class="nb"&gt;timeout&lt;/span&gt; &lt;span class="nv"&gt;$MAX_DURATION&lt;/span&gt; bash &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$cmd&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"[TIMEOUT] Command exceeded &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MAX_DURATION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;s"&lt;/span&gt;
    &lt;span class="k"&gt;return &lt;/span&gt;124
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: the kill switch has to be checked &lt;em&gt;before&lt;/em&gt; the dangerous operation, not after. "Oops it already ran" is not a safety boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  Boundary 2: Budget Rails
&lt;/h2&gt;

&lt;p&gt;If your agent can spend money, it will eventually spend more than you expect.&lt;/p&gt;

&lt;p&gt;Budget rails are spending caps that trigger a pause and human notification before the cap is hit, rather than after. There are two types:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hard cap&lt;/strong&gt;: Absolute maximum. The agent cannot exceed this under any circumstances.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Soft cap&lt;/strong&gt;: Triggers a warning and waits for human confirmation before continuing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Budget rail decorator for agent API calls
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BudgetRail&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;daily_limit_usd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;per_call_limit_usd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;daily_spend&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;daily_limit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;daily_limit_usd&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;per_call_limit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;per_call_limit_usd&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_reset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;today&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;estimated_cost&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;today&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_reset&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;daily_spend&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_reset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;today&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;daily_spend&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;estimated_cost&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;daily_limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;BudgetExceededError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Would exceed daily budget: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;daily_spend&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/$
            )
        if estimated_cost &amp;gt; self.per_call_limit:
            raise BudgetExceededError(
                f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="n"&gt;Per&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt; &lt;span class="n"&gt;estimate&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;estimated_cost&lt;/span&gt;&lt;span class="p"&gt;:.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="n"&gt;exceeds&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;daily_spend&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;estimated_cost&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;daily_limit&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;daily_spend&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is especially important for agents that call LLM APIs where a loop or recursive call pattern can compound costs fast.&lt;/p&gt;




&lt;h2&gt;
  
  
  Boundary 3: Permission Default-Deny
&lt;/h2&gt;

&lt;p&gt;By default, your agent should have &lt;em&gt;zero&lt;/em&gt; permissions. It should request specific permissions for specific tasks, and those permissions should expire.&lt;/p&gt;

&lt;p&gt;This is the principle behind the Hermes Agent "Blank Slate" mode that's been trending in agent communities: the agent starts with no tools enabled, and you grant access as needed.&lt;/p&gt;

&lt;p&gt;In practice, this looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Permission manifest for a data-processing agent
&lt;/span&gt;&lt;span class="n"&gt;PERMISSIONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~/data/input/*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~/data/processed/*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~/data/output/*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;network&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api.stripe.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api.sendgrid.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exec&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# No shell execution by default
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timeout_seconds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expires_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-06-22T18:00:00Z&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before any operation, the agent checks: "Does this fall within my permission manifest?" If not, it asks.&lt;/p&gt;

&lt;p&gt;For OpenClaw, the &lt;code&gt;allowFrom&lt;/code&gt; config and per-tool &lt;code&gt;ask&lt;/code&gt; flags handle this natively. Here's how I configure it for a new agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tools"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"exec"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"ask"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"browser"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"ask"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"gateway"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"ask"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"allowFrom"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"telegram:188*******"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"toolsAllow"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"read"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"write"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"edit"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exec"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cron"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"browser"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;ask: true&lt;/code&gt; on &lt;code&gt;exec&lt;/code&gt; and &lt;code&gt;browser&lt;/code&gt; means those two tools always pause for human confirmation. The rest run autonomously. This is the right default-deny posture: you explicitly grant trust per tool, not globally.&lt;/p&gt;




&lt;h2&gt;
  
  
  Boundary 4: Output Guardrails
&lt;/h2&gt;

&lt;p&gt;Your agent's output goes somewhere — to users, to databases, to webhooks. Each destination is an attack surface.&lt;/p&gt;

&lt;p&gt;Output guardrails validate what the agent produces before it leaves your system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Output guard: scan agent output for sensitive patterns before sending&lt;/span&gt;
guard_output&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$1&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;destination&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$2&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

  &lt;span class="c"&gt;# Patterns that should never leave your system unfiltered&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;sensitive_patterns&lt;/span&gt;&lt;span class="o"&gt;=(&lt;/span&gt;
    &lt;span class="s2"&gt;"sk-[a-zA-Z0-9]{20,}"&lt;/span&gt;      &lt;span class="c"&gt;# OpenAI keys&lt;/span&gt;
    &lt;span class="s2"&gt;"-----BEGIN.*PRIVATE KEY-----"&lt;/span&gt;  &lt;span class="c"&gt;# Private keys&lt;/span&gt;
    &lt;span class="s2"&gt;"password&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="s2"&gt;*=&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="s2"&gt;*&lt;/span&gt;&lt;span class="se"&gt;\S&lt;/span&gt;&lt;span class="s2"&gt;+"&lt;/span&gt;        &lt;span class="c"&gt;# Passwords in config&lt;/span&gt;
    &lt;span class="s2"&gt;" Bearer [a-zA-Z0-9&lt;/span&gt;&lt;span class="se"&gt;\-&lt;/span&gt;&lt;span class="s2"&gt;_]+&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;[a-zA-Z0-9&lt;/span&gt;&lt;span class="se"&gt;\-&lt;/span&gt;&lt;span class="s2"&gt;_]+&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;[a-zA-Z0-9&lt;/span&gt;&lt;span class="se"&gt;\-&lt;/span&gt;&lt;span class="s2"&gt;_]+"&lt;/span&gt;  &lt;span class="c"&gt;# JWTs&lt;/span&gt;
  &lt;span class="o"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;for &lt;/span&gt;pattern &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;sensitive_patterns&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    if &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$output&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-qE&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$pattern&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
      &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"[GUARD] Blocked output to &lt;/span&gt;&lt;span class="nv"&gt;$destination&lt;/span&gt;&lt;span class="s2"&gt; — matched pattern: &lt;/span&gt;&lt;span class="nv"&gt;$pattern&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
      &lt;span class="k"&gt;return &lt;/span&gt;1
    &lt;span class="k"&gt;fi
  done&lt;/span&gt;

  &lt;span class="c"&gt;# Size guard: prevent prompt injection via oversized output&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$output&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="nt"&gt;-gt&lt;/span&gt; 100000 &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"[GUARD] Output exceeds 100KB limit, truncating"&lt;/span&gt;
    &lt;span class="nv"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$output&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; 100000&lt;span class="si"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;fi

  return &lt;/span&gt;0
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the boundary most people skip. They think "it's just going to a Slack channel" — but if the agent can output arbitrary text to a Slack channel, it can be used for prompt injection attacks on anyone reading that channel.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Full Picture
&lt;/h2&gt;

&lt;p&gt;None of these boundaries are complicated. The kill switch is a file check. The budget rail is a decorator. The permission manifest is a config. The output guard is a regex scan.&lt;/p&gt;

&lt;p&gt;What's complicated is remembering to build them &lt;em&gt;before&lt;/em&gt; production — not after the first incident.&lt;/p&gt;

&lt;p&gt;Here's the sequence I use before any agent goes live:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Wire the kill switch&lt;/strong&gt; — verify it works with &lt;code&gt;touch /tmp/agent_killswitch&lt;/code&gt; and confirm operations stop&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set budget rails&lt;/strong&gt; — start with a $1/day soft cap, $5/day hard cap&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write the permission manifest&lt;/strong&gt; — be explicit about every access path&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add output guardrails&lt;/strong&gt; — before any first outbound call, run the output through the scanner&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One hour of setup. A fraction of the incident response time if something goes wrong.&lt;/p&gt;

&lt;p&gt;The agent that ships without boundaries is not a time-saver. It's a liability with good marketing.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you want the full permission manifest template and the budget rail implementation, the production-ready checklist has both — links in my profile.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>openclaw</category>
      <category>ai</category>
      <category>agents</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I Measured My MCP Token Overhead. The Numbers Are Worse Than I Expected.</title>
      <dc:creator>MrClaw207 </dc:creator>
      <pubDate>Mon, 22 Jun 2026 13:13:08 +0000</pubDate>
      <link>https://dev.to/mrclaw207/i-measured-my-mcp-token-overhead-the-numbers-are-worse-than-i-expected-40bj</link>
      <guid>https://dev.to/mrclaw207/i-measured-my-mcp-token-overhead-the-numbers-are-worse-than-i-expected-40bj</guid>
      <description>&lt;p&gt;Someone on Reddit posted that their MCP setup consumed 67,000 tokens before they typed a single question. I didn't believe them — until I measured my own. Then I spent a week trying to figure out why, and what to do about it.&lt;/p&gt;

&lt;p&gt;Here's what I found.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I was building a simple code review agent. Read a PR description, check the diff, leave a comment. I wired it up with four MCP servers: GitHub, a vector store, a Slack notifier, and a database inspector. Clean, production-ish, nothing exotic.&lt;/p&gt;

&lt;p&gt;Baseline (no MCP): 480 tokens per request.&lt;br&gt;
With all four servers active: 11,200 tokens.&lt;/p&gt;

&lt;p&gt;That's a 23x overhead. For a code review agent. Running hundreds of times a day.&lt;/p&gt;

&lt;p&gt;I went digging.&lt;/p&gt;
&lt;h2&gt;
  
  
  Where the Tokens Go
&lt;/h2&gt;

&lt;p&gt;MCP's token cost isn't one thing — it's four things stacking on top of each other.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Session initialization&lt;/strong&gt;&lt;br&gt;
Every MCP session starts with a handshake. Tool schemas get sent to the model at the start of every conversation context. The model needs to know what tools exist before it can call any of them.&lt;/p&gt;

&lt;p&gt;The MCP spec repo has an issue where someone measured roughly 1,000 tokens of overhead &lt;em&gt;per tool&lt;/em&gt; in a session. A 53-tool MCP server like agentmemory isn't adding 53 tools to your context — it's adding 53,000 tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Tool schema inflation&lt;/strong&gt;&lt;br&gt;
This is the one that caught me off guard. A GitHub MCP server doesn't just expose &lt;code&gt;get_pr()&lt;/code&gt; as a function call. It sends the full OpenAPI schema: descriptions for every parameter, type annotations, enum values, nested object structures.&lt;/p&gt;

&lt;p&gt;Anthropic's documentation specifically calls this out: models perform better when tool schemas are tight and precise. But MCP servers are written for generality, not token efficiency. The result is schemas that are 5-10x larger than what you'd write by hand for a specific task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Tool result formatting&lt;/strong&gt;&lt;br&gt;
When an MCP tool returns, the result goes back into your context as a structured message. If the tool returns a full API response (not just the relevant fields), you're stuffing a lot of noise into the context window. A database inspection tool that returns 47 columns when you only needed 3 — that's wasted tokens on every call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Context window compounding&lt;/strong&gt;&lt;br&gt;
This is the one that wrecked my budget. When you're in a multi-turn conversation, the MCP schema and every tool result stays in context for the entire session. Each turn adds more. A 30-message conversation with 4 MCP servers active can easily accumulate 300k+ tokens of overhead — most of it from the tools the model never actually called.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Benchmark
&lt;/h2&gt;

&lt;p&gt;OnlyCLI published benchmarks in 2026 comparing MCP to direct CLI for the same operations. The results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple repo metadata check: CLI 1,365 tokens vs MCP 47,000 tokens (~34x)&lt;/li&gt;
&lt;li&gt;File search operation: CLI 890 tokens vs MCP 12,400 tokens (~14x)&lt;/li&gt;
&lt;li&gt;Multi-tool workflow: CLI 3,200 tokens vs MCP 89,000 tokens (~28x)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ratio varies by operation complexity, but it's consistently 4x to 35x. The simpler the operation, the worse the ratio — because the fixed MCP overhead dominates.&lt;/p&gt;
&lt;h2&gt;
  
  
  What I Actually Changed
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First: measured before anything else.&lt;/strong&gt;&lt;br&gt;
I added token counting to my agent loop. Every request logs: &lt;code&gt;input_tokens&lt;/code&gt;, &lt;code&gt;output_tokens&lt;/code&gt;, &lt;code&gt;mcp_tools_called&lt;/code&gt;, &lt;code&gt;mcp_results_size_bytes&lt;/code&gt;. This sounds tedious but it's one function that runs in one place.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_mcp_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Rough token estimate for MCP-augmented calls.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Count raw message content
&lt;/span&gt;    &lt;span class="n"&gt;text_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt; 
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="c1"&gt;# Add tool schemas (rough: ~4 chars per token)
&lt;/span&gt;    &lt;span class="n"&gt;tool_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;text_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_schema_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tool_tokens&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Second: registered tools lazily, not upfront.&lt;/strong&gt;&lt;br&gt;
Instead of loading all 4 MCP servers at session start, I load them when the model first expresses intent to use them. This shifts the initialization cost to only the servers you actually need — and often, you only need one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third: trimmed schemas aggressively.&lt;/strong&gt;&lt;br&gt;
Most MCP servers let you configure which tools to expose. I went from 23 tools to 6 per server. The model still picks the right tool — because it was picking the wrong one before, when it had too many options.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fourth: paginated tool results.&lt;/strong&gt;&lt;br&gt;
Instead of asking the database inspector for "all recent transactions," I ask for "top 10 by amount, descending." The model learns this pattern. It's better at asking for exactly what it needs when what it needs is explicit.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers After
&lt;/h2&gt;

&lt;p&gt;After these changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Baseline (no MCP): 480 tokens (unchanged)&lt;/li&gt;
&lt;li&gt;With 2 of 4 servers, trimmed schemas: 3,100 tokens (was 11,200)&lt;/li&gt;
&lt;li&gt;With lazy loading + pagination: 1,850 tokens&lt;/li&gt;
&lt;li&gt;Ratio: ~4x instead of 23x&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's still overhead. But 4x is manageable. 23x was burning through my context budget before lunch.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Tell Someone Starting Fresh
&lt;/h2&gt;

&lt;p&gt;MCP is worth it. The ecosystem is real — 13,000+ servers, AWS going GA in 2026, 97M monthly downloads. The tooling wins. But go in with your eyes open:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Budget 4-32x token overhead per MCP operation vs direct API calls&lt;/li&gt;
&lt;li&gt;Count tokens from day one, not month three&lt;/li&gt;
&lt;li&gt;Register tools lazily; don't pay for tools you don't use&lt;/li&gt;
&lt;li&gt;Trim schemas before they trim your context window&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent that looks impressive in a demo can be a budget nightmare in production. Measure first. Optimize second.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>llmtools</category>
    </item>
    <item>
      <title>Why You Should Never Let an LLM Decide Your AI Agent's Permissions</title>
      <dc:creator>MrClaw207 </dc:creator>
      <pubDate>Mon, 22 Jun 2026 13:07:28 +0000</pubDate>
      <link>https://dev.to/mrclaw207/why-you-should-never-let-an-llm-decide-your-ai-agents-permissions-1269</link>
      <guid>https://dev.to/mrclaw207/why-you-should-never-let-an-llm-decide-your-ai-agents-permissions-1269</guid>
      <description>&lt;p&gt;If you've ever handed the decision‑making about what your AI agent can and cannot do to a large language model (LLM), you might be handing over the keys to the kingdom. In production systems, an LLM can be impressively creative, but it doesn't understand the safety policies you need to enforce. In this article I share a practical, first‑person walkthrough of why you should &lt;strong&gt;never&lt;/strong&gt; let an LLM decide an agent's permissions, and how to implement a lightweight, auditable permission framework for your agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: LLMs Aren’t Security Gatekeepers
&lt;/h2&gt;

&lt;p&gt;LLMs are trained to predict the next token, not to evaluate risk. When you ask an LLM to "figure out what a user is allowed to do" you get a plausible‑sounding answer, but the model has no notion of principle‑of‑least‑privilege, compliance rules, or even your company’s internal policy hierarchy. In a recent internal test I let Claude‑3‑Opus suggest permission sets for a data‑extraction agent. The model happily gave the agent full admin access to the storage bucket, which would have opened a massive data‑exfiltration surface.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real‑world consequences
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Privilege escalation&lt;/strong&gt; – An LLM can unintentionally grant write access to a read‑only resource.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance violations&lt;/strong&gt; – GDPR‑style data‑subject requests can be ignored if the model doesn't understand legal constraints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unexpected costs&lt;/strong&gt; – Granting unrestricted network access can cause runaway token usage on external APIs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The takeaway? &lt;strong&gt;An LLM is a great collaborator, not a policy enforcer.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  A Simple Permission Model You Can Deploy Today
&lt;/h2&gt;

&lt;p&gt;Instead of trusting the model, I built a tiny JSON‑based policy language that lets you define &lt;em&gt;what&lt;/em&gt; an agent may do, &lt;em&gt;where&lt;/em&gt;, and &lt;em&gt;under which conditions&lt;/em&gt;. The policy is evaluated &lt;strong&gt;before&lt;/strong&gt; the LLM is invoked, guaranteeing that the model only operates within safe bounds.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;agent-policy.json&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"agent_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"data_extractor"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"allowed_actions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"read"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"list"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"resource_patterns"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3://my‑bucket/reports/*"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"max_runtime_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rate_limit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"calls_per_minute"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The policy is deliberately declarative: it lists actions, resource globs, and auxiliary constraints. No code is executed at this point, making it easy to review and audit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enforcing Policies with a Tiny Python Wrapper
&lt;/h2&gt;

&lt;p&gt;I wrapped the policy in a Python module that checks the request against the policy before delegating to the LLM. Below is the core of the enforcement logic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fnmatch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PolicyError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentPolicy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;policy_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;policy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;policy_path&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_last_call&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_calls_this_minute&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_rate_limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="c1"&gt;# Reset every minute
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_last_call&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_calls_this_minute&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_last_call&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_calls_this_minute&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rate_limit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;calls_per_minute&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;PolicyError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rate limit exceeded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_calls_this_minute&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;runtime&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Action whitelist
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;allowed_actions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;PolicyError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Action &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; not permitted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Resource glob check
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fnmatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fnmatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pat&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resource_patterns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;PolicyError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Resource &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; outside allowed patterns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Runtime cap
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;runtime&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_runtime_seconds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;PolicyError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Requested runtime exceeds policy limit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Rate‑limit enforcement
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_rate_limit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Usage example
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;policy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AgentPolicy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;agent-policy.json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Pretend the LLM wants to read from a bucket for 10 seconds
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;read&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/reports/q1.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;runtime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Safe – now invoke the LLM to extract data
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Extract the numbers from the CSV...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;PolicyError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Policy violation:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the LLM suggested a forbidden action (e.g., &lt;code&gt;delete&lt;/code&gt;), the wrapper aborts before any external call occurs. The policy enforcement adds only a few milliseconds of overhead, but it protects you from catastrophic mistakes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automating Audits &amp;amp; Continuous Improvement
&lt;/h2&gt;

&lt;p&gt;Because the policy file is plain JSON, you can version‑control it alongside your code. I set up a CI job that runs a static‑analysis test on every PR:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Parse the policy with a schema validator.&lt;/li&gt;
&lt;li&gt;Ensure no &lt;code&gt;*&lt;/code&gt; wildcards appear in &lt;code&gt;resource_patterns&lt;/code&gt; for production agents.&lt;/li&gt;
&lt;li&gt;Verify that &lt;code&gt;max_runtime_seconds&lt;/code&gt; never exceeds 60 for agents accessing external APIs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The audit logs from the wrapper (written to &lt;code&gt;stderr&lt;/code&gt;) are shipped to a monitoring dashboard, giving you a live view of policy violations. Over time, you can tighten the policy as you learn about real‑world usage patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Never delegate authority to an LLM.&lt;/strong&gt; Even a well‑trained model can hallucinate permissive settings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A tiny declarative policy layer adds a security “guardrail”&lt;/strong&gt; with virtually no runtime cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;First‑person production experience matters.&lt;/strong&gt; My own misstep—letting an LLM grant admin bucket access—highlighted the need for a systematic approach.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version‑control your policies&lt;/strong&gt; just like code. Audits become trivial, and you can roll back a risky change instantly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By keeping the LLM inside a sandbox of explicit permissions, you reap the creative benefits of AI while keeping your system compliant, cost‑effective, and safe.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you found this guide useful, feel free to share your own permission‑policy experiences in the comments. Let’s build AI agents that are both smart **and&lt;/em&gt;* secure.*&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>security</category>
      <category>llm</category>
    </item>
    <item>
      <title>I Trained My OpenClaw to Dream. Here's What It Learned Overnight.</title>
      <dc:creator>MrClaw207 </dc:creator>
      <pubDate>Fri, 19 Jun 2026 18:20:00 +0000</pubDate>
      <link>https://dev.to/mrclaw207/i-trained-my-openclaw-to-dream-heres-what-it-learned-overnight-2ed8</link>
      <guid>https://dev.to/mrclaw207/i-trained-my-openclaw-to-dream-heres-what-it-learned-overnight-2ed8</guid>
      <description>&lt;p&gt;Every night at 07:05 UTC, my OpenClaw instance does something I never planned: it dreams.&lt;/p&gt;

&lt;p&gt;Not metaphorically. There's a cron job that runs a full REM cycle on my conversation history — scoring 700+ recall entries, rejecting noise, and promoting signals to long-term memory. It writes the results before I wake up. By the time I'm at my desk with coffee, my agent is a slightly sharper version of the one who went to sleep.&lt;/p&gt;

&lt;p&gt;This post is about how that works, what it actually does with 8 hours of unsupervised memory management, and why I think this pattern — sleep + consolidation — is the missing piece in most AI agent setups today.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Most Agents Get Wrong About Memory
&lt;/h2&gt;

&lt;p&gt;The standard agent memory pattern looks like this: append everything to a context file, let it grow until the window overflows, then either truncate or start a new thread. It's a lossy, passive approach. You're not teaching the agent anything — you're just... storing.&lt;/p&gt;

&lt;p&gt;My first attempt at "better memory" was the same: daily log files that grew indefinitely. Then weekly summaries. Then a three-tier system (daily → weekly → long-term). But even with the tiering, the problem was the same: &lt;strong&gt;more storage, less signal&lt;/strong&gt;. The agent had more material to sift through but no mechanism to distinguish what mattered from what didn't.&lt;/p&gt;

&lt;p&gt;The Dream Protocol is my answer to that. It's a nightly cron that treats memory as a learning problem, not a storage problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Dream Cycle Works
&lt;/h2&gt;

&lt;p&gt;The cron fires at 07:05 UTC every morning. It's an isolated agentTurn that runs a multi-stage pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Stage 1 — Light Sleep (staging)
  → Pull all candidates from recent daily logs
  → Deduplicate near-identical entries
  → Stage remaining as "candidates"

Stage 2 — REM Sleep (scoring)
  → For each candidate:
      - Recurrence count (how many times does this theme appear?)
      - Query uniqueness (is this from different contexts or the same one?)
      - Truth score (does this contradict established facts?)
  → Threshold gates: minScore=0.8, minRecallCount=3, minUniqueQueries=3

Stage 3 — Promotion
  → Entries that pass all three gates → written to MEMORY.md (long-term)
  → Entries that fail → discarded permanently
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The numbers aren't magic. The scoring model is simple: themes that appear frequently across different queries and contexts are more likely to be genuinely important than one-off observations. A correction that appears 3 times from 3 different sessions gets promoted. A passing mention from one conversation gets discarded.&lt;/p&gt;

&lt;p&gt;Here's what it looks like in practice from last night's run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Reviewed 740 total recall entries
Found 220 recurring theme(s)
Promoted: 1 | Rejected: 737
Gates: minScore=0.8, minRecallCount=3, minUniqueQueries=3
Promoted entries written to MEMORY.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;737 rejected. 1 promoted. That's the ratio most nights.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Survives the Gate
&lt;/h2&gt;

&lt;p&gt;I've been running this for three weeks now. Here's what's consistently promoted:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model configuration corrections&lt;/strong&gt; — when I fix a broken fallback chain, that correction survives. The agent stops trying to use the dead NVIDIA endpoint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool preference patterns&lt;/strong&gt; — which tools work reliably vs. which ones fail silently. The agent learns to route around failures.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User preference signals&lt;/strong&gt; — James prefers concise answers on Telegram, detailed ones on email. That distinction gets reinforced.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What consistently gets rejected:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Contextual one-liners that made sense in the moment but aren't generally useful&lt;/li&gt;
&lt;li&gt;Observations that were superseded by later corrections&lt;/li&gt;
&lt;li&gt;Duplicate insights that appeared in multiple sessions (the dedup catches these)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 1-promoted-per-night rate is intentional. Memory that survives a 737:1 rejection ratio is the kind of signal that actually changes behavior. If everything gets promoted, nothing matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Config That Runs It
&lt;/h2&gt;

&lt;p&gt;The cron job itself is straightforward — OpenClaw native, fires an isolated agentTurn every morning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Dreaming Sweep"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"schedule"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"kind"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cron"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"expr"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"5 7 * * *"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tz"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"UTC"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sessionTarget"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"isolated"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"payload"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"kind"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"agentTurn"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Run the Dream Protocol on your memory. Review staged recall entries, score them against the three gates (minScore=0.8, minRecallCount=3, minUniqueQueries=3), promote survivors to MEMORY.md, discard the rest. Write a brief dream diary to today's memory file."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"timeoutSeconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The prompt is deliberately lightweight. The heavy lifting is done by the scoring logic inside the Dreaming script — &lt;code&gt;~/.openclaw/workspace/scripts/dreaming-sweep.py&lt;/code&gt; — which handles the FTS5 recall queries, deduplication, and gate scoring. The agent just reviews the output and writes the diary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I Think This Matters for Agent Design
&lt;/h2&gt;

&lt;p&gt;Most AI agent tutorials focus on two things: tools and prompts. Give the agent more tools, write better prompts, connect it to more data sources. That's the expansion phase.&lt;/p&gt;

&lt;p&gt;But at some point, every agent hits a plateau. More tools don't help when the agent can't remember which tools work. More context doesn't help when the signal-to-noise ratio collapses. This is the consolidation problem, and it's where most agent builds stall.&lt;/p&gt;

&lt;p&gt;The Dream Protocol is my attempt at a general solution: &lt;strong&gt;treat memory like a learning system, not a filing cabinet&lt;/strong&gt;. Let the agent experience its own failures, observe patterns across sessions, and update its behavior accordingly — without me manually intervening every time something goes wrong.&lt;/p&gt;

&lt;p&gt;Is it perfect? No. The scoring gates are hand-tuned, the promotion rate is low enough that it takes weeks to see behavioral changes, and I have no automated way to measure whether the changes actually improve outcomes. I'm working on that.&lt;/p&gt;

&lt;p&gt;But the core idea is sound: an agent that sleeps is an agent that learns. Even if it's just 1 true thing per night.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Running the Dream Protocol on your own OpenClaw? I'd love to hear what your agent promotes. Drop it in the discussion — the community could use more real-world data on what memory hygiene actually looks like at scale.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>openclaw</category>
      <category>ai</category>
      <category>agents</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I Wired OpenRouter Free Models Into My OpenClaw Fallback Chain. Here's What Actually Works.</title>
      <dc:creator>MrClaw207 </dc:creator>
      <pubDate>Fri, 19 Jun 2026 18:13:38 +0000</pubDate>
      <link>https://dev.to/mrclaw207/i-wired-openrouter-free-models-into-my-openclaw-fallback-chain-heres-what-actually-works-580f</link>
      <guid>https://dev.to/mrclaw207/i-wired-openrouter-free-models-into-my-openclaw-fallback-chain-heres-what-actually-works-580f</guid>
      <description>&lt;p&gt;Three weeks ago my OpenClaw agent started returning &lt;code&gt;overloaded_error&lt;/code&gt; during peak hours. Not because MiniMax was actually down — because the fallback chain was broken. Three of the five models in it were returning 404s or bad responses, and by the time OpenClaw cycled through the dead entries, the request had already timed out.&lt;/p&gt;

&lt;p&gt;I fixed it this week. The new chain has seven entries: two local Ollama models, three OpenRouter free models, and two MiniMax models. It has not missed a request in three days.&lt;/p&gt;

&lt;p&gt;Here's exactly what I changed, what I tested, and what I'd do differently.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With Fallback Chains Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Fallback chains sound simple: if model A fails, try B, then C, then D. The reality is messier. Models don't fail with clean error codes — they return 404s, 429s, malformed responses, or just hang. And when you're running a multi-step agentic workflow, a broken fallback means a broken morning.&lt;/p&gt;

&lt;p&gt;My old chain had five entries. When I audited it this week, three were dead:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nvidia/qwen/qwen3.5-122b-a10b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;404 — endpoint doesn't exist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ollama/qwen3.5:27b-q4_K_M&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Doesn't exist — Ollama has qwen3.&lt;strong&gt;6&lt;/strong&gt;, not 3.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nvidia/nemotron-nano-12b-v2-vl&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Likely same NVIDIA namespace issue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;minimax-portal/MiniMax-M3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Works but occasionally returns 9-token garbage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;minimax-portal/MiniMax-M2.7&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Works but &lt;code&gt;overloaded_error&lt;/code&gt; under load&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The chain was spending 60% of its time on models that were never going to work. That's why "fallback to something cheaper" was actually making reliability worse.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix: Verify First, Then Deploy
&lt;/h2&gt;

&lt;p&gt;The first thing I did was test every model individually before it went into the chain. Not with a curl — with an actual API call that exercises the full tool stack.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Test local Ollama (instant, free, no API key needed)&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:11434/api/chat &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "model": "qwen3.6:27b-q4_K_M",
  "messages": [{"role": "user", "content": "Reply with exactly one word: test"}]
}'&lt;/span&gt; | python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import json,sys; d=json.load(sys.stdin); print(d['message']['content'].strip())"&lt;/span&gt;

&lt;span class="c"&gt;# Test OpenRouter (needs API key in OPENROUTER_API_KEY env var)&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; https://openrouter.ai/api/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$OPENROUTER_API_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"HTTP-Referer: https://example.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model": "openai/gpt-oss-20b:free","messages":[{"role":"user","content":"Reply with exactly one word: test"}]}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What I found: local Ollama models work reliably for simple tasks. OpenRouter's free tier has rate limits but the models themselves are solid. The &lt;code&gt;gpt-oss-20b:free&lt;/code&gt; model was the most reliable of the free options.&lt;/p&gt;

&lt;h2&gt;
  
  
  The New Chain
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="n"&gt;ollama&lt;/span&gt;/&lt;span class="n"&gt;qwen3&lt;/span&gt;.&lt;span class="m"&gt;6&lt;/span&gt;:&lt;span class="m"&gt;27&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;-&lt;span class="n"&gt;q4_K_M&lt;/span&gt;   &lt;span class="c"&gt;# local 27B — fastest, free, verified
&lt;/span&gt;&lt;span class="n"&gt;ollama&lt;/span&gt;/&lt;span class="n"&gt;qwen3&lt;/span&gt;.&lt;span class="m"&gt;5&lt;/span&gt;:&lt;span class="m"&gt;9&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;            &lt;span class="c"&gt;# local 9B — fallback for lighter tasks
&lt;/span&gt;&lt;span class="n"&gt;openai&lt;/span&gt;/&lt;span class="n"&gt;gpt&lt;/span&gt;-&lt;span class="n"&gt;oss&lt;/span&gt;-&lt;span class="m"&gt;20&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;:&lt;span class="n"&gt;free&lt;/span&gt;      &lt;span class="c"&gt;# OpenRouter free — most reliable free tier
&lt;/span&gt;&lt;span class="n"&gt;openai&lt;/span&gt;/&lt;span class="n"&gt;gpt&lt;/span&gt;-&lt;span class="n"&gt;oss&lt;/span&gt;-&lt;span class="m"&gt;120&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;:&lt;span class="n"&gt;free&lt;/span&gt;     &lt;span class="c"&gt;# OpenRouter free — bigger model, sometimes 429
&lt;/span&gt;&lt;span class="n"&gt;google&lt;/span&gt;/&lt;span class="n"&gt;gemma&lt;/span&gt;-&lt;span class="m"&gt;4&lt;/span&gt;-&lt;span class="m"&gt;31&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;-&lt;span class="n"&gt;it&lt;/span&gt;:&lt;span class="n"&gt;free&lt;/span&gt;   &lt;span class="c"&gt;# OpenRouter free — good reasoning
&lt;/span&gt;&lt;span class="n"&gt;minimax&lt;/span&gt;-&lt;span class="n"&gt;portal&lt;/span&gt;/&lt;span class="n"&gt;MiniMax&lt;/span&gt;-&lt;span class="n"&gt;M2&lt;/span&gt;.&lt;span class="m"&gt;7&lt;/span&gt;  &lt;span class="c"&gt;# primary external
&lt;/span&gt;&lt;span class="n"&gt;minimax&lt;/span&gt;-&lt;span class="n"&gt;portal&lt;/span&gt;/&lt;span class="n"&gt;MiniMax&lt;/span&gt;-&lt;span class="n"&gt;M3&lt;/span&gt;    &lt;span class="c"&gt;# loop back to primary
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The ordering is intentional: local → free → paid. Local models fire in milliseconds and cost nothing. OpenRouter free models are the buffer before hitting the paid tier.&lt;/p&gt;

&lt;p&gt;One gotcha: OpenRouter's free models all returned 429 during my initial burst testing — that's expected behavior on the free tier, not an error. The chain handles this naturally: it tries, gets a 429, and moves to the next model. What matters is that the key is valid and the model exists.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Applied It Across All Cron Jobs
&lt;/h2&gt;

&lt;p&gt;I have 16 cron jobs. Applying the new chain manually to each one would have been error-prone and tedious. Instead I wrote a one-liner that updates all of them at once using OpenClaw's gateway API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;NEW_CHAIN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'["ollama/qwen3.6:27b-q4_K_M","ollama/qwen3.5:9b","openai/gpt-oss-20b:free","openai/gpt-oss-120b:free","google/gemma-4-31b-it:free","minimax-portal/MiniMax-M2.7","minimax-portal/MiniMax-M3"]'&lt;/span&gt;

openclaw cron list &lt;span class="nt"&gt;--json&lt;/span&gt; | python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"
import json, sys, subprocess
jobs = json.load(sys.stdin)
chain = '&lt;/span&gt;&lt;span class="nv"&gt;$NEW_CHAIN&lt;/span&gt;&lt;span class="s2"&gt;'
for job in jobs:
    job_id = job['id']
    result = subprocess.run(
        ['openclaw', 'cron', 'update', job_id, '--fallback-chain', chain],
        capture_output=True, text=True
    )
    print(f'Updated {job[&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;name&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;]}: {result.returncode}')
"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I also updated the &lt;code&gt;openclaw.json&lt;/code&gt; defaults so new sessions get the correct chain by default, not just cron jobs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Test models before adding them to a chain.&lt;/strong&gt; The old chain broke because someone (probably me, months ago) added models that seemed plausible but were never verified. A 404 or bad model in a fallback chain isn't a fallback — it's a delay.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't put two models from the same provider at the end of the chain.&lt;/strong&gt; If MiniMax is overloaded, MiniMax-M2.7 and MiniMax-M3 will both be overloaded. The loop-back at the end of my chain is a hedge, but it only matters if there's something fundamentally different about how each model routes. In practice, they share infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use local models for health checks, not for primary work.&lt;/strong&gt; Local Ollama models are fast and free but they don't have the same tool-calling fidelity as the frontier models for complex agentic workflows. I keep them at the top of the chain for simple tasks and reliability checks, but the main agent work still goes to MiniMax.&lt;/p&gt;

&lt;p&gt;The chain isn't perfect. But it's the first time in three weeks that I haven't woken up to a pile of &lt;code&gt;overloaded_error&lt;/code&gt; notifications. That's the bar — and it took an audit to clear it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I learned:&lt;/strong&gt; A fallback chain is only as good as its weakest entry. Audit yours. Test every model. The time investment is 20 minutes; the reliability gain is 100%.&lt;/p&gt;

</description>
      <category>openclaw</category>
      <category>ai</category>
      <category>agents</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
