The Skills Team

Posted on Jan 21

What We Learned Building Agent Orchestration Systems (The Hard Way)

#ai #llm #agents #programming

We spent a week iterating on autonomous agent designs. We started with vibes and buzzwords. We ended with shell scripts and git worktrees. Here's what survived.

The Starting Point: Vibes Engineering

Our first attempt was called "The Omni-Orchestrator" (yes, really). It had a value function:

User Value = (Outcome Quality × Speed) / (Token Cost + User Friction)

It "dynamically instantiated Sub-Personas" like "Senior Rust Engineer" and "Legal Researcher." It promised "recursive self-optimization."

The review was brutal:

"Dynamic Sub-Personas" is theater. You're not instantiating anything—you're just prompting yourself to roleplay. "Senior Rust Engineer" vs "Claude who knows Rust" is zero difference. It's prompt-dressing that evaporates after one response.

And:

"Recursive, self-optimizing"—no mechanism. Where's the feedback loop? Where's the measurement? How does it know it's getting better? This is vibes, not architecture.

Score: 5/10. Good concepts, no execution.

The First Real Improvement: State Management

Version 2 added one thing that changed everything: a log file.

[ISO_TIMESTAMP] | [STATE] | [ITERATION_ID] | [ACTION] | [RESULT] | [ARTIFACTS]

This sounds trivial. It wasn't. The original system had no memory. Every response started fresh. Now there was persistence. You could crash mid-task and resume. You could audit what happened.

The review noted:

This is the single biggest improvement. The original had no memory. Now there's persistence. Checkable. Resumable. This alone makes v2 viable where v1 wasn't.

Lesson: Files are your database. If state isn't written down, it doesn't exist.

Killing the Fuzzy Metric: TDD as Acceptance Criteria

The value function was elegant and useless. How do you measure "Outcome Quality"? You don't.

Version 2 replaced it with something binary: a test that fails, then passes.

STATE 2: TEST-DRIVEN SETUP
Goal: Create the failure signal.

Actions:
1. Create the test script defined in STATE 1
2. Run the test
3. Verify: The test must FAIL

STATE 3: EXECUTION LOOP
Goal: Make the test pass.
Exit Gate: All tests pass.

The review called this out:

Brilliant move. "User Value" is no longer a formula—it's a green test. Binary. Measurable. The test IS the acceptance criteria.

Lesson: If you can't define "done" mechanically, your agent will never know when to stop.

The Permission Model: Not All Actions Are Equal

Early versions treated every action the same. Read a file? Same as rm -rf. This is insane.

Version 3 introduced tiered autonomy:

TIER 1 (SAFE) → Execute immediately
- Read-only operations (ls, grep, cat)
- Creating new files in project directory

TIER 2 (RISKY) → Safeguard, then execute
- Modifying existing code
- Deleting files
- Installing local packages

Protocol:
1. Check for git
2. If dirty working tree → backup or warn
3. Execute

TIER 3 (CRITICAL) → Pause and confirm
- Recursive deletion (rm -rf)
- Global installs
- Network egress
- Executing fetched content (curl | bash)

Protocol: Explain the risk. Wait for explicit "Y".

This is obvious in retrospect. Your agent shouldn't need permission to read a file. It absolutely should need permission to pipe curl to shell.

Lesson: Classify operations by blast radius. Most frameworks don't. They should.

The Escape Hatch: Knowing When to Quit

Early versions would loop forever on failure. Or worse, they'd "try lateral thinking"—which meant nothing.

Version 3 added hard limits:

Max Iterations per Strategy: 5
Max Strategy Shifts: 2
Total: 10 iterations maximum

If exhausted:
1. Rollback to clean state
2. Write BLOCKER_REPORT.md:
   - Strategies Attempted
   - Error Logs
   - Hypothesis of Root Cause
   - Recommended Manual Intervention
3. STOP. Await user guidance.

This is critical. An agent that knows when to give up is more useful than one that burns tokens forever. The blocker report tells you why it failed, not just that it failed.

Lesson: Agents need a structured "I'm stuck" state. Infinite retry is not a strategy.

The Strategy Shift: Forcing Lateral Thinking

When you fail five times, what do you do? Most agents: try the same thing a sixth time.

The strategy shift protocol forces something different:

STRATEGY SHIFT PROTOCOL (after 5 failures):
1. Rollback all changes from this strategy
2. Return to clean state
3. Re-analyze—do NOT repeat the same approach
4. Log: "STRATEGY SHIFT: [New Approach]"
5. Consider:
   - Changing libraries?
   - Mocking vs. real implementation?
   - Hardcoding to isolate variables?

The key insight: rollback before shifting. Don't accumulate garbage from failed attempts. Start fresh with a new hypothesis.

Lesson: Structured failure is better than random retry. Force the agent to actually change approach.

Scope Enforcement: Trust But Verify

Here's a problem: you tell an agent to "fix the auth module" and it edits your database schema. How do you prevent this?

Version 4 introduced scope constraints:

## Scope Constraint
You may ONLY modify files matching: `^src/auth/`
Any edit outside this pattern is a FATAL violation.

But telling isn't enough. You verify after execution:

violations=$(git diff --name-only "$base_ref" | grep -vE "$scope_regex" || true)

if [ -n "$violations" ]; then
    echo "SCOPE_VIOLATION: $violations"
    # Do NOT merge this work
fi

This is cheaper than trying to prevent bad actions. Let the agent work, then mechanically check it stayed in bounds.

Lesson: Post-execution validation catches what prompts can't prevent.

The Orchestrator Doesn't Code

At this point we split the system in two: a Boss and Workers.

The Boss has one constraint:

You do NOT edit code. You generate Process Infrastructure.

The Boss writes:

Work manifests (task decomposition)
Context files (worker instructions)
Shell scripts (execution pipeline)
Status tracking

Workers do the actual implementation. This separation prevents a nasty failure mode: the planner getting distracted by implementation details, or the implementer making architectural decisions it shouldn't.

Lesson: Planning agents shouldn't implement. Implementing agents shouldn't plan.

True Parallelism: Git Worktrees

Most "parallel agent" systems are lies. They make concurrent API calls that race on the filesystem. Agent A writes to config.json. Agent B writes to config.json. One of them loses.

Git worktrees solve this:

# W1 works in .worktrees/W1
git worktree add -b "feat/w1-auth" ".worktrees/W1" main

# W2 works in .worktrees/W2
git worktree add -b "feat/w2-ui" ".worktrees/W2" main

They literally cannot overwrite each other's files. Each worktree is a separate directory with its own branch. Conflicts only surface at merge time—where they belong.

Lesson: If you need real parallelism, you need real isolation. Git worktrees give you this for free.

The Merge Ceremony

Parallel work has to come back together. This is where things get dangerous.

The merge ceremony is deliberate:

1. Read all worker status files
2. Triage:
   - SUCCESS → Queue for merge
   - FAILED → Log, skip
   - SCOPE_VIOLATION → Quarantine, do NOT merge
   - RUNNING → Crashed, treat as failed

3. Merge in dependency order:
   main ──●──────────●──────────● (final)
          │          │          │
          │ merge W1 │ merge W3 │ merge W2

4. Run integration tests after each merge
5. If tests fail: revert that merge, continue with others

This preserves partial success. If 3 of 4 workers succeeded, you get 3 of 4 features. The failed one is documented, not lost.

Lesson: Merge is where parallel work gets dangerous. Make it explicit, ordered, and recoverable.

Files as Database, Markdown as API

One philosophy emerged across all versions:

Files are database—no hidden state.
Markdown as API—plans and logs are readable/editable.

When an agent writes its plan to .zen/plan.md, a human can read it. And edit it. And the agent will follow the edits.

When state is in a log file, you debug by reading a file—not parsing stdout or searching through API logs.

Lesson: Human-readable state is debuggable state. Binary formats and hidden state are where agents go to die.

The Collaboration vs. Isolation Tradeoff

We ended up with two systems that solve different problems.

The Team system (personas in shared context):

Members can hear each other and collaborate
"Neo, what do you think of Peter's plan?"
Flexible, conversational handoffs
But: not truly parallel, no hard isolation

The Boss/Worker system (isolated processes):

True parallelism via worktrees
Hard scope enforcement
But: workers can't talk to each other
Heavy ceremony (manifests, scripts, status files)

The honest answer: there's no free lunch.

Parallel agents can't collaborate—that's what makes them parallel. Collaborating agents can't parallelize—they need shared context to interact.

Pick based on your task:

Independent features in different directories? Boss/Worker.
Design decisions that need discussion? Team.
Single coherent task? Neither—just run one agent.

What Would We Build Next?

A hybrid. Take Zen's simplicity (single .zen/ directory, --retry for failures, editable plan.md) and add:

Scope enforcement via regex — post-execution validation
Pre-execution rollback tags — git tag pre-zen-$(date +%s)
Tiered autonomy — pause before destructive operations
Strategy shift protocol — after 5 failures, force a new approach

The ceremony of Boss/Worker isn't worth it for most tasks. But the safety patterns are worth stealing.

The Patterns That Stuck

After a week of iteration and brutal reviews, these survived:

Pattern	Why It Works
State in files	Resumable, auditable, debuggable
TDD as acceptance	Binary "done" signal
Tiered autonomy	Not all actions are equal
Escape hatch	Agents need to know when to quit
Strategy shift	Force lateral thinking after failure
Scope enforcement	Trust but verify
Boss doesn't code	Separate planning from implementation
Worktree isolation	Real parallelism, not races
Merge ceremony	Ordered, recoverable integration
Markdown as API	Humans can read and override

None of these are novel. They're engineering basics—state machines, separation of concerns, fail-safe defaults, mechanical verification.

The insight is that agent systems need these basics more than traditional software does. LLMs are unpredictable. They drift. They hallucinate. They retry the same failed approach forever.

Constraints, state machines, and mechanical verification are how you build something reliable out of something unpredictable.

This research was conducted by iteratively prompting, reviewing, and refining agent system designs. The reviews were harsh. The designs got better. That's the process.

Top comments (1)

Timothé Mermet-Buffet • Jan 21

Really interesting article!