wei wu

Posted on Apr 5

Why Agent Systems Need a Control Plane

#agents #architecture #llm #systemdesign

Why Agent Systems Need a Control Plane

From Model Bridge to Runtime Governance — Lessons from Building an Agent Runtime with 7 Providers, 610 Tests, and 36 Versions

The Problem Nobody Talks About

Everyone is building agent systems. Few are governing them.

The typical agent architecture looks clean on a whiteboard: User → LLM → Tools → Response. But in production, you quickly discover that the hard problems aren't about making the LLM smarter — they're about making the system controllable.

Consider what happens when you deploy an agent that connects to external LLM providers and executes tools on behalf of users:

Provider A goes down. Does your system fail? Retry forever? Switch to Provider B? How fast?
The LLM hallucinates a tool call with wrong parameter names. Does the tool crash? Does the user see an error?
The conversation grows to 300KB. Does the request timeout? Does it consume your entire context window?
Your cron job hasn't fired in 6 hours. Do you notice? Does anyone get alerted?
Two memory layers return contradictory information. Which one does the LLM trust?

These are not capability problems. They are governance problems. And they require a different kind of architecture: a control plane.

What Is an Agent Control Plane?

Borrowing from networking and Kubernetes, a control plane is the layer that manages how the system operates, separate from the data plane that does the actual work.

┌─────────────────────────────────────────────────┐
│                Control Plane                     │
│  Policy │ Routing │ Observability │ Recovery     │
└──────────────────────┬──────────────────────────┘
                       │ governs
┌──────────────────────▼──────────────────────────┐
│                Capability Plane                  │
│  LLM Calls │ Tool Execution │ Smart Routing     │
└──────────────────────┬──────────────────────────┘
                       │ remembers
┌──────────────────────▼──────────────────────────┐
│                Memory Plane                      │
│  KB Search │ Multimodal │ Preferences │ Status   │
└─────────────────────────────────────────────────┘

For agent systems, the control plane handles:

Concern	What It Does	Without It
Provider Routing	Select the right model for each request	Hardcoded to one provider, no fallback
Tool Governance	Whitelist tools, fix malformed args, enforce limits	LLM calls arbitrary tools with broken params
Request Shaping	Truncate oversized messages, manage context budget	Context overflow, timeouts, OOM
Circuit Breaking	Detect failures, route to fallback, auto-recover	Cascading failures, stuck requests
Observability	Track latency/success/degradation with historical trends	Flying blind in production
Audit	Log state changes with tamper-evident chain hashing	No accountability, no debugging
Memory Governance	Deduplicate cross-layer results, resolve conflicts	LLM gets contradictory context

The Key Insight: Governance Must Lead

"The stronger capabilities get, the harder the system is to control — governance must lead, not follow."

This is counterintuitive. When building an agent, the natural instinct is to focus on capabilities first: add more tools, connect more models, support more modalities. Governance feels like something you bolt on later.

But in practice, every capability you add without governance creates uncontrolled blast radius:

Adding a new LLM provider without fallback routing? One DNS change takes down your system.
Letting the LLM call any tool? One hallucinated parameter corrupts your data.
Growing the context window without truncation policy? One long conversation consumes 10x your token budget.
Adding a memory layer without deduplication? The LLM sees the same paper three times from three sources.

The pattern we discovered after 36 versions: build the control plane first, then add capabilities inside it. Not the other way around.

Architecture: Three Planes in Practice

Control Plane — The Governor

The control plane is the thickest layer. It touches every request.

Circuit Breaker — zero-delay failover across 7 LLM providers:

class CircuitBreaker:
    def is_open(self):
        if self.consecutive_failures < threshold:
            return False              # closed: try primary
        if time.time() - self.open_since >= reset_seconds:
            return False              # half-open: allow probe
        return True                   # open: skip to fallback

Provider Compatibility Layer: 7 providers (Qwen3, GPT-4o, Gemini, Claude, Kimi, MiniMax, GLM) with standardized auth, capability declarations, and a compatibility matrix
Tool whitelist: 14 allowed tools + 2 custom (search_kb, data_clean), schema simplification, auto-repair for 7 classes of malformed arguments
Request shaping: Dynamic truncation based on context usage (>85% → aggressive 50KB, >70% → moderate 100KB)
SLO Dashboard: 5 metrics with historical tracking, sparkline trends, hourly snapshots, threshold alerting
Security boundary: All services bind localhost, API keys via env vars only, automated leak scanning, 93/100 security score

Capability Plane — The Worker

Multi-provider LLM routing (Qwen3-235B primary → Gemini fallback, 0ms switchover)
Multimodal: text → Qwen3, images → Qwen2.5-VL (auto-detected from message content)
Custom tool injection: data_clean and search_kb intercepted by proxy, executed locally
Smart routing: simple queries → fast model, complex → full model

Memory Plane — The Rememberer

This is where v2 of our architecture added the most value. Five scattered scripts became a unified memory system:

# One query searches all memory layers
results = memory_plane.query("Qwen3 performance")
# → KB semantic results + multimodal matches + relevant preferences + active priorities
# → Cross-layer deduplication removes duplicates
# → Confidence scoring ranks KB (1.0) > multimodal (0.85) > status (0.7) > preferences (0.6)
# → Conflict resolver flags contradictions between layers

4 layers: KB semantic search (local embeddings), multimodal memory (Gemini embeddings), user preferences (auto-learned), operational status
Cross-layer dedup: Same filename or similar text across layers → merge, keep highest score
Confidence scoring: Layer-based weights + freshness decay (>72h KB results get penalty)
Conflict resolution: When preferences contradict active priorities → annotate, penalize, let LLM decide
Graceful degradation: Any layer can be unavailable without affecting others

Evidence: 7 Fault Injection Experiments

We built a reliability bench that simulates 7 production failure modes. All mock-based, runs in < 3 seconds, integrated into CI:

#	Scenario	Injection	Control Plane Response	Checks
1	Provider down	3 consecutive failures	Circuit opens → fallback → auto-heal	10/10
2	Backend timeout	Server hangs indefinitely	Timeout at 1s, no thread leak	2/2
3	Malformed args	Wrong params, extra fields, bad JSON	Auto-repair: 7 alias mappings + stripping	7/7
4	Oversized request	407KB message history	Truncation to 197KB, system + recent kept	6/6
5	KB miss-hit	Nonexistent topic	Graceful empty response	9/9
6	Cron drift	2-hour stale heartbeat	Detected, 34 registry entries validated	5/5
7	State corruption	Invalid/truncated/empty JSON	Detected, atomic writes prevent corruption	8/8

Result: 7/7 PASS, 47/47 checks. Without the control plane, scenarios 1-4 cause user-visible failures. With it, they're handled transparently.

Production SLO Results

From real production data:

SLO	Target	Actual	Verdict
Latency p95	≤ 30s	459ms	PASS
Timeout rate	≤ 3%	0%	PASS
Tool success rate	≥ 95%	100%	PASS
Degradation rate	≤ 5%	1%	PASS
Auto-recovery rate	≥ 90%	100%	PASS

Recovery Time Characteristics

Failure Mode	Detection	Recovery	User Impact
Primary LLM down	Immediate	0ms failover, 300s auto-heal	Fallback model used
Backend timeout	Configurable (1-300s)	Immediate error return	User retries
Malformed tool args	Immediate	0ms auto-repair	None (transparent)
Oversized request	Immediate	0ms truncation	Old context dropped
State corruption	On next read	Atomic write prevents	None if writes are atomic

Lessons from 36 Versions

1. 610 tests ≠ system works

We had 393 tests passing when our PA (personal assistant) told users "I have no projects." The tests verified components; the failure was in the seams between components — the system prompt was empty, the shared state wasn't being consumed. Lesson: test the system, not just the parts.

2. Every safety layer is a potential failure source

After a crontab incident (all jobs wiped by echo | crontab -), we added three protection layers. Then we had to debug the protection layers. Lesson: before adding safety, ask "who already handles this?"

3. Memory without governance is noise

We had 5 memory components producing results. But without deduplication, the LLM saw the same paper three times. Without confidence scoring, a stale preference ranked above a fresh semantic match. Without conflict resolution, contradictory signals confused the model. Lesson: memory is a governance problem too.

4. Atomic writes are non-negotiable

Every state file uses the tmp-then-rename pattern. One crash during a write would corrupt state. With atomic writes, you either have the old version or the new version, never a partial one.

5. The version that matters is the one in /health

We added the semver string (0.36.0) to every /health endpoint. When debugging production issues, the first question is always "which version is actually running?" — not which version you think is running.

The Argument

Agent systems are rapidly gaining capabilities. Models get smarter, tools get more powerful, context windows get larger, memory systems get richer. But without a control plane:

Failures cascade because there's no circuit breaker
Costs explode because there's no request shaping
Memory contradicts itself because there's no cross-layer governance
Debugging is impossible because there's no observability
Recovery is manual because there's no auto-healing

The agent ecosystem is building ever-more-capable data planes. What's missing — and what we've spent 36 versions building — is the governance layer that makes them production-grade.

An agent control plane isn't a nice-to-have. It's the difference between a demo and a system.

Build the control plane first. Then add capabilities inside it. Not the other way around.

This article is based on openclaw-model-bridge (v0.36.0), an open-source agent runtime control plane. 7 LLM providers, 610 tests across 23 suites, 7 fault injection scenarios, and 12 months of production operation serving a WhatsApp-based AI assistant.

Top comments (1)

Krrish Dholakia • Jun 11

Interesting article @bisdom. I feel like a lot of what you're describing is an AI Gateway - besides the memory part (which is interesting, but I wonder how it works if i want to use things like mem0?).

I was thinking of an agent control plane as something that would be above the agent runtimes - leaving execution to them, but abstracting the core components like durable sessions, memory, cron jobs.

Curious, if this resonates - dev.to/krrish_dholakia/why-we-need...