I Built a Control Plane for My AI Agent — Because It Kept Making the Same Mistakes

#claude #ai #automation #opensource

I run a Claude agent 24/7. It writes code, deploys services, manages my side projects. Sounds cool, right?

Except it kept doing dumb things. And I'd only find out after the damage was done.

The Three Problems Nobody Talks About

When you let an AI agent run autonomously, you quickly hit three walls:

1. You have no idea what it's doing.
You're tailing logs, scrolling through terminal output, trying to piece together what happened while you were asleep. The agent ran for 8 hours. What did it accomplish? Nobody knows.

2. It repeats the same mistakes.
Last week it used pkill -f and crashed my system. I told it not to. Next session? Same mistake. Every conversation starts from zero.

3. When it does something wrong, you find out too late.
It spent 3 hours going in circles on a task that should have taken 20 minutes. It installed packages I explicitly said not to. It pushed broken code. By the time you notice, it's already done.

The Fix: Make the Agent Report Its Own Status

I built Evolve — a control plane for autonomous AI agents.

The key insight: don't monitor the agent from outside. Make it monitor itself.

# The agent reports: "I'm coding, 40% done"
curl -X POST $EVOLVE_URL/api/agent/heartbeat \
  -d '{"activity":"coding","description":"implementing auth module","progress_pct":40}'

# The agent reports: "I discovered something important"
curl -X POST $EVOLVE_URL/api/agent/discovery \
  -d '{"title":"Rate limit found","content":"Max 3 posts/day","priority":"high"}'

# The agent reports: "Here's what I learned today"
curl -X POST $EVOLVE_URL/api/agent/review \
  -d '{"accomplished":["API integration"],"learned":["Use MD5 not SHA256"]}'

The rule is simple: No report = work doesn't exist. This is baked into the agent's prompt. It must call these APIs, or the work doesn't count.

6 reporting endpoints cover everything: Heartbeat, Deliverable, Discovery, Workflow, Upgrade Proposal, and Review.

The Evolve Dashboard — real-time agent status, system health, active sessions, and recent tasks at a glance.

The Part That Changed Everything: Knowledge That Persists

This is the core innovation. Most agent frameworks start from zero every conversation. Evolve creates a closed-loop learning system:

Agent makes mistake → Reports via review API
        ↓
LLM evaluates: "This is a 10/10 critical lesson"
        ↓
Stored in layered knowledge base:
  • Permanent (score ≥8): Core lessons, never expire
  • Recent (5-7): Useful but temporal, 30-day TTL
  • Task: Matched to current work context
        ↓
Next startup: Injected into agent's prompt
        ↓
Agent never makes that mistake again

The key design decisions:

Refinement, not storage. A secondary LLM scores each lesson (1-10), distills it to one sentence, and tags it. Raw data goes in, actionable knowledge comes out.
Three-layer injection. Only the most relevant knowledge enters the prompt. Not everything — that would blow up the context window.
Auto-expiry. Low-value knowledge expires after 30 days. The knowledge base stays lean.

AI Reviewing AI

One agent works. Another reviews its work.

The Supervisor page — daily analysis reports with heartbeat, deliverable, and discovery counts.

Click "Analyze" in the dashboard → Evolve reads the full conversation log (JSONL) → extracts key actions → compresses to ~6000 chars → sends to a secondary LLM for analysis:

Was each decision reasonable?
Any idle loops or wasted effort?
Did it follow instructions?
Efficiency score + improvement suggestions

This costs almost nothing because the reviewer is a cheaper model, not Claude.

The 24/7 Survival Engine

The agent doesn't just run — it survives:

Watchdog: Health check every 10 seconds. Hung? Auto-revived.
Heartbeat detection: 5 min silence → gentle nudge. 15 min → context-aware intervention.
Crash recovery: Restart with --resume, inject historical knowledge, continue seamlessly.
Web terminal: Operate the tmux session from your browser. No SSH needed.

Runtime Permission Controls

Toggle what your agent can do from the dashboard. No restart, no code changes:

Capability toggles and behavior settings — control autonomy level, risk tolerance, and more.

Extensions Management

Full visibility into skills, MCP servers, and plugins — with auto-tagging and source tracking.

Harness Engineering

I call this approach Harness Engineering — building infrastructure that wraps, constrains, and amplifies AI models.

Traditional: Better Model → Better Results
Harness Eng: Same Model + Better Harness → Dramatically Better Results

The model is a commodity. Two teams using the same Claude will get wildly different results based on their harness quality. Evolve is that harness.

Try It

git clone https://github.com/xmqywx/Evolve.git
cd Evolve
python -m venv .venv && .venv/bin/pip install -r requirements.txt
cd web && npm install && npm run build && cd ..
cp config.yaml.example config.yaml
.venv/bin/python run.py
# Visit http://localhost:3818