Charles Wu for seekdb

Posted on Apr 27

Harness Engineering in Practice: Building a 6-Agent System That Runs Itself

#agents #ai #llm #openclaw

“Six agents” here means one orchestrator (Zoe) plus five specialist agents. Six ACP coding experts run as concurrent implementation workers — not counted in that headline number.

Your Day Has Been Taken Over

Overnight, the trading agent ships the prior US session wrap-up. By morning, the macro analyst has the pre-market brief ready. The butler has pushed weather, schedule, and to-dos. AINews (AI Sentinel) has scanned GitHub Trending, arXiv’s latest papers, and 100+ sources — 18+ curated items ranked by importance. Content (Content Strategist) is tracking trending topics across 50+ platforms.

Here’s what matters most to me — automatic tracking of AI dynamics and tech trends. After discovering valuable projects or papers, the system doesn’t just push news — it evaluates impact on our systems and provides P0/P1/P2 action recommendations. Valuable discoveries enter Zoe’s Tech Radar (Zoe is the CTO Agent), going through evaluation → decision → delegated coding implementation.

60 cron tasks run automatically every day (3 AM backup to 11:45 PM reflection). Agents are evolving on their own — mistakes are remembered, recurrence rates drop significantly. This isn’t rules I wrote — it’s autonomous iteration from .learnings/ to MEMORY.md.

System: 1 orchestrator (Zoe) + 5 specialized agents (AINews, Trading, Macro, Content, Butler) + 6 ACP coding experts + 60 cron tasks + 100+ Skills + ~30 configured model profiles + 23 automatic recoveries in two weeks.

Note: Metrics based on February-March 2026 monitoring. Individual results may vary.

System Architecture

┌─────────────────────────────────────────────┐
│              User (Human)                   │
│    Requirements + Key Node Approval         │
└─────────────────┬───────────────────────────┘
                  │
         ┌────────▼────────┐
         │   Zoe (CTO)     │
         │  3x Daily Check │
         └────────┬────────┘
                  │
    ┌─────────────┼─────────────┐
    │             │             │
┌───▼───┐   ┌────▼────┐   ┌───▼───┐
│AINews │   │ Trading │   │ Macro │
└───┬───┘   └────┬────┘   └───┬───┘
    │            │            │
    └────────────┼────────────┘
                 │
        ┌────────▼────────┐
        │ Content + Butler│
        └────────┬────────┘
                 │
        ┌────────▼────────┐
        │   Event Bus     │
        │ + Shared Context│
        └────────┬────────┘
                 │
        ┌────────▼────────┐
        │ ACP Coding      │
        │ (6 concurrent)  │
        └─────────────────┘

Key Design Decisions:

Agents Evolving Autonomously

Designed protocols — Zoe diagnosed communication issues, designed three-state protocol (request → confirmed → final, with silent as the default "no news is good news" state), solidified into AGENTS.md
Self-developed Skills — Content researched ways to make drafts sound less generically LLM-written (“de-AI” polish), wrote Skills, published to ClawHub (shared repository)
Strategy roundtables — Macro + Trading produce weekly reports with data snapshots, position recommendations, stop-loss discipline
Task Watcher — Zoe designed cron-level Task Callback Event Bus for async monitoring

My role: Set up framework, establish constraints, confirm direction. Requirement discovery, solution research, protocol design, implementation — all done by agents.

Team: 1+5+6 Formation

Zoe (CTO / Chief Orchestrator)

3 daily inspections (10:00/14:00/22:00 PT): cron execution, disk usage, session health, Chrome DevTools Protocol (CDP) leak checks, .learnings/ pending, shared-context/ timestamps.

Weekly: Analyze each agent’s MEMORY.md, execute layered compression.

Key capability: Solution design — three-state protocol, Task Watcher, Communication Guardrail framework, all designed autonomously.

AINews (AI Sentinel) — Intelligence Hub

Collects from 100+ sources daily: GitHub Trending, arXiv, RSS, HackerNews, Reddit. 7 cron tasks: morning brief (08:30), midday paper (12:00), evening trends (20:00).

Critical capability: Proactive tech impact evaluation. Discovered ReMe framework → proposed to Zoe → I confirmed → agents executed.

Toolchain: github_trending.py, rss_aggregator.py, arxiv_papers.py, Tavily, agent-browser. Anti-hallucination: every item MUST have a URL, reachability self-check, and unverifiable items labeled single-source.

Trading (Quantitative Analyst)

21 cron tasks (densest load). 20 quant tools, 15 Skills (68K+ lines), 65/35 scoring (tool/AI). Covers US stocks + commodities + crypto.

Four-step framework: Macro factors → scoring (technical 25% / flow 30% / fundamentals 10% / sentiment 20% / market 15%) → cross-check (sanity-check vs. macro and flow) → target + score + stop-loss + confidence.

Not financial advice — automated research output only; you are responsible for any real-money decisions.

Hard rules (system policy, not investment advice): no entry without a defined stop, never fabricate data, confidence <60% = “wait.”

Macro (Chief Economist)

9 cron tasks: Morning (07:50) → Midday (12:30) → Evening (18:00) → US pre-market (22:00) → morning digest of the prior US session (05:20 PT) — scheduled after the cash close, not at the closing bell. Sunday weekly review → Trading references for market review.

Discipline: Cite sources, distinguish facts vs judgments, mark confidence (high >70% / medium 50–70% / low <50%), propose counter-arguments.

Real case: Iran tension → traditional: “gold rises” → actual: oil +14%, gold -5%. Macro: “inflation logic dominates, not safe haven.” Saved to MEMORY.md.

Content (Content Strategist)

9 cron tasks: Research (09:00, 50+ platforms) → Ideate (10:30, consume AINews) → Write (14:00, score drafts) → Reflect (22:10).

Autonomous evolution: Discovered content too “AI-flavored” → researched humanizing / de-generic copy tools → wrote Skills → published to ClawHub.

Five-Basket Radar: AI/Tech (≤40%), Product/Startup, Solopreneur, Investment/Macro, Social/International. 40% AI cap self-imposed during reflection.

Butler (Life Assistant)

7 cron tasks: Greeting (08:00) → Schedule (08:30) → 5 water reminders (rotating styles) → Health (20:00) → Summary (22:00).

Philosophy: <50 chars per reminder, ≥1.5h interval, 23:00–07:00 emergency only, no pestering if no reply.

ACP Coding Experts

Pi / Claude Code / Codex / OpenCode / Gemini / GPT-4.1-Codex. Max 6 concurrent, 120min TTL — queue or shed load when saturated so you don’t stampede gateways. Analysis agents don’t code — delegated via sessions_spawn.

Design Lesson

Don’t let analysis agents code directly. Early setup: coding + architect + PM roles. Result: almost no output, high overlap with Zoe + ACP, increased complexity. Cut them all. Zoe handles PM + architect.

Complexity grows fast: pairwise coordination explodes (six specialists ≈ fifteen pairwise handoffs if everyone talks to everyone). Each new agent ≈ half a day debugging conflicts, resource competition, and rule compatibility.

Three Core Engineering Problems

Problem 1: Context Is the Agent’s OS

The Problem: Entropy Always Increases

Without constraints, agent systems deterministically collapse. Agents are processes without an OS: no memory management, no garbage collection, no OOM protection.

Three incidents:

P0–8 Hours Paralysis

AINews session: 235K tokens. Gateway compaction → timeout → crash → macOS launchd ThrottleInterval=1 infinite loop. All agents offline.

Fix: Clean session → ThrottleInterval 1→10 → idleMinutes 180→30 → execution policy tightened from permissive to allowlist (smaller blast radius; keep the list maintained). Four defense lines missing.

P1–3500 Chars → 800 Chars

Trading’s flash report had data tables. OpenClaw auto-compacted when exceeded textChunkLimit—"intelligently compressed" away. AI "help" is disaster in data-dense scenarios.

P2 — Rules Ignore After Bloat

Sessions bloat to 10K+ tokens → agents “selectively comply.” Butler doing investment analysis. Trading ignoring validation. Critical info drowned in noise.

Solution: Dual-Layer Control

Layer 1: Context Engineering (information architecture)

SOUL.md (front): Identity + hard constraints + decision framework (40–60 lines)
AGENTS.md (after): Operating norms + collaboration protocols
Skills: Via extraDirs on-demand (Trading: 15 Skills, 68K lines on disk—retrieve or inject only the 1–3 relevant fragments per turn, not the whole tree)
shared-context/: Cross-agent state, read via tools
Obsidian: Cold storage, archives output, no inference

Rule wording targets weakest model (GPT-4.1 → Qwen3.5 → Ollama qwen3:8b):

"Suggest not fabricating" → qwen3:8b ignores
"MUST: do not fabricate" → all comply
"MUST + P0 + NON-NEGOTIABLE" → even weak models comply

Write for weakest link.

Layer 2: Harness (framework lifecycle management)

Without Harness → 235K tokens → crash. Without Context Engineering → all piled → rules drowned.

Representative openclaw.json excerpt (field names drift by release—validate against your OpenClaw version before paste-deploying):

{
  "compaction": {
    "mode": "safeguard",
    "memoryFlush": {
      "enabled": true,
      "softThresholdTokens": 40000,
      "prompt": "Distill to memory/YYYY-MM-DD.md. Focus: decisions, state changes, lessons."
    }
  },
  "contextPruning": { "mode": "cache-ttl", "ttl": "6h", "keepLastAssistants": 3 },
  "session": {
    "reset": { "mode": "daily", "atHour": 5, "idleMinutes": 30 },
    "maintenance": { "pruneAfter": "7d", "maxDiskBytes": 104857600 }
  },
  "hooks": { "bootstrap": ["self-improving-agent"] }
}

Cross-session recovery:

New session → SOUL.md + AGENTS.md + MEMORY.md + .learnings/ → memorySearch → shared-context/
= "Knows who, what done, what team doing"

Problem 2: Let Agents Remember and Grow

The Problem: Repeating Mistakes

Trading got BILLBOARD_BUY_AMT wrong 5 times (wrote BUY_AMT). Session reset → lost memory → repeat. User corrects → agent changes → 3 days later same scenario → same error.

Chatbot vs Agent dividing line: Agents learn from mistakes.

Solution: Five-Layer Memory

Autonomous Memory: 6-Step Cycle

Trigger: Operation failed · User corrected · Better approach found
L4 Recording: Write to .learnings/ERRORS.md or LEARNINGS.md
Daily Reflection (22:00): Review .learnings/, Zoe aggregates cross-agent value
PROMOTE: 3+ verifications → MEMORY.md, single → keep observing
L2 Sedimentation: Weekly compression, ❤000 tokens
L5 Skill: Generalizable → write as Skills → ClawHub

This is the core mechanism. Without it: chatbot. With it: agent.

Chatbot vs Agent

Problem 3: Let Agents Collaborate

The Problem: Multi-Agent Communication

Initial issues:

Status sync failures: A finished, B didn’t know
Resource contention: Multiple agents write same file
Information silos: Macro produced, Trading never saw
Responsibility gaps: “Who’s handling this?” → all silent

Solution: Three-State Protocol + Event Bus

Protocol (three active states + default silent state):

request → confirmed → final → [silent]

request: Explicitly acknowledges, starts the loop
confirmed: In progress, sends intermediate updates
final: Complete, result delivered, loop closes
[silent]: Default state when no active task — “no news is good news” (prevents spam)

Event Bus:

{
  "type": "MARKET_CLOSE",
  "source": "TRADING",
  "timestamp": "2026-03-07T15:00:00-08:00",
  "payload": { "symbol": "SPY", "note": "schema omitted for brevity" },
  "requiresAck": false
}

Shared Context:

tech-radar.json — Read-only except authorized writers
market-status.json — Trading updates, Macro/Content consume

Guardrails:

No ad-hoc cross-agent file writes; mediated writes only (tools, bus, approved writers)
All communication via event bus or shared-context; never park API keys or session tokens in shared JSON — use your platform’s secret store
Zoe has final arbitration

Results: 4 Weeks, 23 Auto-Recoveries

Timeline:

Week 1: Basic setup, single agent, frequent crashes
Week 2: Multi-agent coordination, protocols established
Week 3: Autonomous evolution, agents self-fixing
Week 4: Production-ready, 60 cron tasks smooth

Key Metrics:

What I Learned

1. Agents Need an OS, Not Just Prompts

Context Engineering is OS design. You need:

Memory management (compaction, pruning, reset)
Process isolation (separate workspaces)
IPC mechanisms (event bus, shared context)
Garbage collection (session cleanup, disk limits)

2. Memory = Chatbot vs Agent

Can’t remember yesterday’s mistakes = fancy chatbot. Five-layer memory transforms stateless LLM calls into stateful, learning entities.

3. Constraints Enable Creativity

Clear boundaries = more creative, not less. 40% AI quota, three-state protocol, hard “MUST” rules — these are guardrails for autonomous operation.

4. Multi-Model Fallback Is Production Necessity

GPT-4.1 → Qwen3.5 → Ollama qwen3:8b. Write rules for weakest link.

5. Human: Doer → Designer

My job: design system where code writes itself. I’m architect, not bricklayer.

Looking Ahead

Next steps:

P0: Dead-letter queue for failed events
P1: Manual resend CLI for stuck tasks
P1: Audit log rotation
P2: Visual dashboard for system health

Goal: Amplify human capability. One person + six agents > one person + zero agents. That’s Harness Engineering.

Quick Reference

Agent Roster

Total: 60 cron, ~90 Skills

Daily Schedule (Pacific Time, America/Los_Angeles)

Cron rows are snapshots from my stack — align to your exchange calendar, asset class (equities vs. crypto), and whether you are on PST or PDT.

Critical Files

If you run a similar harness, how do you handle failures when compaction, cron, and multi-agent handoffs all interact — what breaks first in your stack, and what fixed it?

References

OpenClaw: https://github.com/openclaw/openclaw
PowerMem: https://github.com/oceanbase/powermem
ClawHub: https://github.com/openclaw/clawhub
Mitchell Hashimoto: “My AI Adoption Journey” — https://mitchellh.com/writing/my-ai-adoption-journey
Anthropic: “Effective harnesses for long-running agents” — https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents

All times Pacific Time (America/Los_Angeles; PST or PDT depending on season). macOS + OpenClaw. Monitoring: Feb–Mar 2026. Validate config against your OpenClaw release at https://github.com/openclaw/openclaw

DEV Community