<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: wei wu</title>
    <description>The latest articles on DEV Community by wei wu (@bisdom).</description>
    <link>https://dev.to/bisdom</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3862447%2F1177ba41-4c7f-40e2-a76e-63ddd8a68832.jpg</url>
      <title>DEV Community: wei wu</title>
      <link>https://dev.to/bisdom</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bisdom"/>
    <language>en</language>
    <item>
      <title>Why Agent Systems Need a Control Plane</title>
      <dc:creator>wei wu</dc:creator>
      <pubDate>Sun, 05 Apr 2026 15:26:52 +0000</pubDate>
      <link>https://dev.to/bisdom/why-agent-systems-need-a-control-plane-48id</link>
      <guid>https://dev.to/bisdom/why-agent-systems-need-a-control-plane-48id</guid>
      <description>&lt;h1&gt;
  
  
  Why Agent Systems Need a Control Plane
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;From Model Bridge to Runtime Governance — Lessons from Building an Agent Runtime with 7 Providers, 610 Tests, and 36 Versions&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Problem Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Everyone is building agent systems. Few are governing them.&lt;/p&gt;

&lt;p&gt;The typical agent architecture looks clean on a whiteboard: User → LLM → Tools → Response. But in production, you quickly discover that the hard problems aren't about making the LLM smarter — they're about making the system &lt;strong&gt;controllable&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Consider what happens when you deploy an agent that connects to external LLM providers and executes tools on behalf of users:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Provider A goes down.&lt;/strong&gt; Does your system fail? Retry forever? Switch to Provider B? How fast?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The LLM hallucinates a tool call&lt;/strong&gt; with wrong parameter names. Does the tool crash? Does the user see an error?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The conversation grows to 300KB.&lt;/strong&gt; Does the request timeout? Does it consume your entire context window?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your cron job hasn't fired in 6 hours.&lt;/strong&gt; Do you notice? Does anyone get alerted?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two memory layers return contradictory information.&lt;/strong&gt; Which one does the LLM trust?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are not capability problems. They are &lt;strong&gt;governance problems&lt;/strong&gt;. And they require a different kind of architecture: a control plane.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is an Agent Control Plane?
&lt;/h2&gt;

&lt;p&gt;Borrowing from networking and Kubernetes, a control plane is the layer that &lt;strong&gt;manages how the system operates&lt;/strong&gt;, separate from the data plane that &lt;strong&gt;does the actual work&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────┐
│                Control Plane                     │
│  Policy │ Routing │ Observability │ Recovery     │
└──────────────────────┬──────────────────────────┘
                       │ governs
┌──────────────────────▼──────────────────────────┐
│                Capability Plane                  │
│  LLM Calls │ Tool Execution │ Smart Routing     │
└──────────────────────┬──────────────────────────┘
                       │ remembers
┌──────────────────────▼──────────────────────────┐
│                Memory Plane                      │
│  KB Search │ Multimodal │ Preferences │ Status   │
└─────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For agent systems, the control plane handles:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;Without It&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Provider Routing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Select the right model for each request&lt;/td&gt;
&lt;td&gt;Hardcoded to one provider, no fallback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tool Governance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Whitelist tools, fix malformed args, enforce limits&lt;/td&gt;
&lt;td&gt;LLM calls arbitrary tools with broken params&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Request Shaping&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Truncate oversized messages, manage context budget&lt;/td&gt;
&lt;td&gt;Context overflow, timeouts, OOM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Circuit Breaking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Detect failures, route to fallback, auto-recover&lt;/td&gt;
&lt;td&gt;Cascading failures, stuck requests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Observability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Track latency/success/degradation with historical trends&lt;/td&gt;
&lt;td&gt;Flying blind in production&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Audit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Log state changes with tamper-evident chain hashing&lt;/td&gt;
&lt;td&gt;No accountability, no debugging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory Governance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deduplicate cross-layer results, resolve conflicts&lt;/td&gt;
&lt;td&gt;LLM gets contradictory context&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Key Insight: Governance Must Lead
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;"The stronger capabilities get, the harder the system is to control — governance must lead, not follow."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is counterintuitive. When building an agent, the natural instinct is to focus on capabilities first: add more tools, connect more models, support more modalities. Governance feels like something you bolt on later.&lt;/p&gt;

&lt;p&gt;But in practice, every capability you add without governance creates &lt;strong&gt;uncontrolled blast radius&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adding a new LLM provider without fallback routing? One DNS change takes down your system.&lt;/li&gt;
&lt;li&gt;Letting the LLM call any tool? One hallucinated parameter corrupts your data.&lt;/li&gt;
&lt;li&gt;Growing the context window without truncation policy? One long conversation consumes 10x your token budget.&lt;/li&gt;
&lt;li&gt;Adding a memory layer without deduplication? The LLM sees the same paper three times from three sources.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern we discovered after 36 versions: &lt;strong&gt;build the control plane first, then add capabilities inside it.&lt;/strong&gt; Not the other way around.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture: Three Planes in Practice
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Control Plane — The Governor
&lt;/h3&gt;

&lt;p&gt;The control plane is the thickest layer. It touches every request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Circuit Breaker&lt;/strong&gt; — zero-delay failover across 7 LLM providers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CircuitBreaker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;consecutive_failures&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;              &lt;span class="c1"&gt;# closed: try primary
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;open_since&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;reset_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;              &lt;span class="c1"&gt;# half-open: allow probe
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;                   &lt;span class="c1"&gt;# open: skip to fallback
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Provider Compatibility Layer&lt;/strong&gt;: 7 providers (Qwen3, GPT-4o, Gemini, Claude, Kimi, MiniMax, GLM) with standardized auth, capability declarations, and a compatibility matrix&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool whitelist&lt;/strong&gt;: 14 allowed tools + 2 custom (search_kb, data_clean), schema simplification, auto-repair for 7 classes of malformed arguments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Request shaping&lt;/strong&gt;: Dynamic truncation based on context usage (&amp;gt;85% → aggressive 50KB, &amp;gt;70% → moderate 100KB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLO Dashboard&lt;/strong&gt;: 5 metrics with historical tracking, sparkline trends, hourly snapshots, threshold alerting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security boundary&lt;/strong&gt;: All services bind localhost, API keys via env vars only, automated leak scanning, 93/100 security score&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Capability Plane — The Worker
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Multi-provider LLM routing (Qwen3-235B primary → Gemini fallback, 0ms switchover)&lt;/li&gt;
&lt;li&gt;Multimodal: text → Qwen3, images → Qwen2.5-VL (auto-detected from message content)&lt;/li&gt;
&lt;li&gt;Custom tool injection: data_clean and search_kb intercepted by proxy, executed locally&lt;/li&gt;
&lt;li&gt;Smart routing: simple queries → fast model, complex → full model&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Memory Plane — The Rememberer
&lt;/h3&gt;

&lt;p&gt;This is where v2 of our architecture added the most value. Five scattered scripts became a unified memory system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# One query searches all memory layers
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;memory_plane&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen3 performance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# → KB semantic results + multimodal matches + relevant preferences + active priorities
# → Cross-layer deduplication removes duplicates
# → Confidence scoring ranks KB (1.0) &amp;gt; multimodal (0.85) &amp;gt; status (0.7) &amp;gt; preferences (0.6)
# → Conflict resolver flags contradictions between layers
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;4 layers&lt;/strong&gt;: KB semantic search (local embeddings), multimodal memory (Gemini embeddings), user preferences (auto-learned), operational status&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-layer dedup&lt;/strong&gt;: Same filename or similar text across layers → merge, keep highest score&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confidence scoring&lt;/strong&gt;: Layer-based weights + freshness decay (&amp;gt;72h KB results get penalty)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conflict resolution&lt;/strong&gt;: When preferences contradict active priorities → annotate, penalize, let LLM decide&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graceful degradation&lt;/strong&gt;: Any layer can be unavailable without affecting others&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Evidence: 7 Fault Injection Experiments
&lt;/h2&gt;

&lt;p&gt;We built a reliability bench that simulates 7 production failure modes. All mock-based, runs in &amp;lt; 3 seconds, integrated into CI:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Injection&lt;/th&gt;
&lt;th&gt;Control Plane Response&lt;/th&gt;
&lt;th&gt;Checks&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Provider down&lt;/td&gt;
&lt;td&gt;3 consecutive failures&lt;/td&gt;
&lt;td&gt;Circuit opens → fallback → auto-heal&lt;/td&gt;
&lt;td&gt;10/10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Backend timeout&lt;/td&gt;
&lt;td&gt;Server hangs indefinitely&lt;/td&gt;
&lt;td&gt;Timeout at 1s, no thread leak&lt;/td&gt;
&lt;td&gt;2/2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Malformed args&lt;/td&gt;
&lt;td&gt;Wrong params, extra fields, bad JSON&lt;/td&gt;
&lt;td&gt;Auto-repair: 7 alias mappings + stripping&lt;/td&gt;
&lt;td&gt;7/7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Oversized request&lt;/td&gt;
&lt;td&gt;407KB message history&lt;/td&gt;
&lt;td&gt;Truncation to 197KB, system + recent kept&lt;/td&gt;
&lt;td&gt;6/6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;KB miss-hit&lt;/td&gt;
&lt;td&gt;Nonexistent topic&lt;/td&gt;
&lt;td&gt;Graceful empty response&lt;/td&gt;
&lt;td&gt;9/9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Cron drift&lt;/td&gt;
&lt;td&gt;2-hour stale heartbeat&lt;/td&gt;
&lt;td&gt;Detected, 34 registry entries validated&lt;/td&gt;
&lt;td&gt;5/5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;State corruption&lt;/td&gt;
&lt;td&gt;Invalid/truncated/empty JSON&lt;/td&gt;
&lt;td&gt;Detected, atomic writes prevent corruption&lt;/td&gt;
&lt;td&gt;8/8&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result: 7/7 PASS, 47/47 checks.&lt;/strong&gt; Without the control plane, scenarios 1-4 cause user-visible failures. With it, they're handled transparently.&lt;/p&gt;

&lt;h3&gt;
  
  
  Production SLO Results
&lt;/h3&gt;

&lt;p&gt;From real production data:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;SLO&lt;/th&gt;
&lt;th&gt;Target&lt;/th&gt;
&lt;th&gt;Actual&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Latency p95&lt;/td&gt;
&lt;td&gt;≤ 30s&lt;/td&gt;
&lt;td&gt;459ms&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Timeout rate&lt;/td&gt;
&lt;td&gt;≤ 3%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool success rate&lt;/td&gt;
&lt;td&gt;≥ 95%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Degradation rate&lt;/td&gt;
&lt;td&gt;≤ 5%&lt;/td&gt;
&lt;td&gt;1%&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auto-recovery rate&lt;/td&gt;
&lt;td&gt;≥ 90%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Recovery Time Characteristics
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure Mode&lt;/th&gt;
&lt;th&gt;Detection&lt;/th&gt;
&lt;th&gt;Recovery&lt;/th&gt;
&lt;th&gt;User Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Primary LLM down&lt;/td&gt;
&lt;td&gt;Immediate&lt;/td&gt;
&lt;td&gt;0ms failover, 300s auto-heal&lt;/td&gt;
&lt;td&gt;Fallback model used&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backend timeout&lt;/td&gt;
&lt;td&gt;Configurable (1-300s)&lt;/td&gt;
&lt;td&gt;Immediate error return&lt;/td&gt;
&lt;td&gt;User retries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Malformed tool args&lt;/td&gt;
&lt;td&gt;Immediate&lt;/td&gt;
&lt;td&gt;0ms auto-repair&lt;/td&gt;
&lt;td&gt;None (transparent)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Oversized request&lt;/td&gt;
&lt;td&gt;Immediate&lt;/td&gt;
&lt;td&gt;0ms truncation&lt;/td&gt;
&lt;td&gt;Old context dropped&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;State corruption&lt;/td&gt;
&lt;td&gt;On next read&lt;/td&gt;
&lt;td&gt;Atomic write prevents&lt;/td&gt;
&lt;td&gt;None if writes are atomic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Lessons from 36 Versions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. 610 tests ≠ system works
&lt;/h3&gt;

&lt;p&gt;We had 393 tests passing when our PA (personal assistant) told users "I have no projects." The tests verified components; the failure was in the &lt;strong&gt;seams between components&lt;/strong&gt; — the system prompt was empty, the shared state wasn't being consumed. Lesson: &lt;strong&gt;test the system, not just the parts.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Every safety layer is a potential failure source
&lt;/h3&gt;

&lt;p&gt;After a crontab incident (all jobs wiped by &lt;code&gt;echo | crontab -&lt;/code&gt;), we added three protection layers. Then we had to debug the protection layers. Lesson: &lt;strong&gt;before adding safety, ask "who already handles this?"&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Memory without governance is noise
&lt;/h3&gt;

&lt;p&gt;We had 5 memory components producing results. But without deduplication, the LLM saw the same paper three times. Without confidence scoring, a stale preference ranked above a fresh semantic match. Without conflict resolution, contradictory signals confused the model. Lesson: &lt;strong&gt;memory is a governance problem too.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Atomic writes are non-negotiable
&lt;/h3&gt;

&lt;p&gt;Every state file uses the tmp-then-rename pattern. One crash during a write would corrupt state. With atomic writes, you either have the old version or the new version, never a partial one.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. The version that matters is the one in /health
&lt;/h3&gt;

&lt;p&gt;We added the semver string (&lt;code&gt;0.36.0&lt;/code&gt;) to every &lt;code&gt;/health&lt;/code&gt; endpoint. When debugging production issues, the first question is always "which version is actually running?" — not which version you think is running.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Argument
&lt;/h2&gt;

&lt;p&gt;Agent systems are rapidly gaining capabilities. Models get smarter, tools get more powerful, context windows get larger, memory systems get richer. But without a control plane:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Failures cascade&lt;/strong&gt; because there's no circuit breaker&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Costs explode&lt;/strong&gt; because there's no request shaping&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory contradicts itself&lt;/strong&gt; because there's no cross-layer governance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Debugging is impossible&lt;/strong&gt; because there's no observability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recovery is manual&lt;/strong&gt; because there's no auto-healing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent ecosystem is building ever-more-capable data planes. What's missing — and what we've spent 36 versions building — is the governance layer that makes them production-grade.&lt;/p&gt;

&lt;p&gt;An agent control plane isn't a nice-to-have. It's the difference between a demo and a system.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Build the control plane first. Then add capabilities inside it. Not the other way around.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;This article is based on &lt;a href="https://github.com/bisdom-cell/openclaw-model-bridge" rel="noopener noreferrer"&gt;openclaw-model-bridge&lt;/a&gt; (v0.36.0), an open-source agent runtime control plane. 7 LLM providers, 610 tests across 23 suites, 7 fault injection scenarios, and 12 months of production operation serving a WhatsApp-based AI assistant.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>architecture</category>
      <category>llm</category>
      <category>systemdesign</category>
    </item>
  </channel>
</rss>
