DEV Community

Cover image for What We Learned Building Agent Orchestration Systems (The Hard Way)
The Skills Team
The Skills Team

Posted on

What We Learned Building Agent Orchestration Systems (The Hard Way)

We spent a week iterating on autonomous agent designs. We started with vibes and buzzwords. We ended with shell scripts and git worktrees. Here's what survived.

The Starting Point: Vibes Engineering

Our first attempt was called "The Omni-Orchestrator" (yes, really). It had a value function:

User Value = (Outcome Quality × Speed) / (Token Cost + User Friction)
Enter fullscreen mode Exit fullscreen mode

It "dynamically instantiated Sub-Personas" like "Senior Rust Engineer" and "Legal Researcher." It promised "recursive self-optimization."

The review was brutal:

"Dynamic Sub-Personas" is theater. You're not instantiating anything—you're just prompting yourself to roleplay. "Senior Rust Engineer" vs "Claude who knows Rust" is zero difference. It's prompt-dressing that evaporates after one response.

And:

"Recursive, self-optimizing"—no mechanism. Where's the feedback loop? Where's the measurement? How does it know it's getting better? This is vibes, not architecture.

Score: 5/10. Good concepts, no execution.

The First Real Improvement: State Management

Version 2 added one thing that changed everything: a log file.

[ISO_TIMESTAMP] | [STATE] | [ITERATION_ID] | [ACTION] | [RESULT] | [ARTIFACTS]
Enter fullscreen mode Exit fullscreen mode

This sounds trivial. It wasn't. The original system had no memory. Every response started fresh. Now there was persistence. You could crash mid-task and resume. You could audit what happened.

The review noted:

This is the single biggest improvement. The original had no memory. Now there's persistence. Checkable. Resumable. This alone makes v2 viable where v1 wasn't.

Lesson: Files are your database. If state isn't written down, it doesn't exist.

Killing the Fuzzy Metric: TDD as Acceptance Criteria

The value function was elegant and useless. How do you measure "Outcome Quality"? You don't.

Version 2 replaced it with something binary: a test that fails, then passes.

STATE 2: TEST-DRIVEN SETUP
Goal: Create the failure signal.

Actions:
1. Create the test script defined in STATE 1
2. Run the test
3. Verify: The test must FAIL

STATE 3: EXECUTION LOOP
Goal: Make the test pass.
Exit Gate: All tests pass.
Enter fullscreen mode Exit fullscreen mode

The review called this out:

Brilliant move. "User Value" is no longer a formula—it's a green test. Binary. Measurable. The test IS the acceptance criteria.

Lesson: If you can't define "done" mechanically, your agent will never know when to stop.

The Permission Model: Not All Actions Are Equal

Early versions treated every action the same. Read a file? Same as rm -rf. This is insane.

Version 3 introduced tiered autonomy:

TIER 1 (SAFE) → Execute immediately
- Read-only operations (ls, grep, cat)
- Creating new files in project directory

TIER 2 (RISKY) → Safeguard, then execute
- Modifying existing code
- Deleting files
- Installing local packages

Protocol:
1. Check for git
2. If dirty working tree → backup or warn
3. Execute

TIER 3 (CRITICAL) → Pause and confirm
- Recursive deletion (rm -rf)
- Global installs
- Network egress
- Executing fetched content (curl | bash)

Protocol: Explain the risk. Wait for explicit "Y".
Enter fullscreen mode Exit fullscreen mode

This is obvious in retrospect. Your agent shouldn't need permission to read a file. It absolutely should need permission to pipe curl to shell.

Lesson: Classify operations by blast radius. Most frameworks don't. They should.

The Escape Hatch: Knowing When to Quit

Early versions would loop forever on failure. Or worse, they'd "try lateral thinking"—which meant nothing.

Version 3 added hard limits:

Max Iterations per Strategy: 5
Max Strategy Shifts: 2
Total: 10 iterations maximum

If exhausted:
1. Rollback to clean state
2. Write BLOCKER_REPORT.md:
   - Strategies Attempted
   - Error Logs
   - Hypothesis of Root Cause
   - Recommended Manual Intervention
3. STOP. Await user guidance.
Enter fullscreen mode Exit fullscreen mode

This is critical. An agent that knows when to give up is more useful than one that burns tokens forever. The blocker report tells you why it failed, not just that it failed.

Lesson: Agents need a structured "I'm stuck" state. Infinite retry is not a strategy.

The Strategy Shift: Forcing Lateral Thinking

When you fail five times, what do you do? Most agents: try the same thing a sixth time.

The strategy shift protocol forces something different:

STRATEGY SHIFT PROTOCOL (after 5 failures):
1. Rollback all changes from this strategy
2. Return to clean state
3. Re-analyze—do NOT repeat the same approach
4. Log: "STRATEGY SHIFT: [New Approach]"
5. Consider:
   - Changing libraries?
   - Mocking vs. real implementation?
   - Hardcoding to isolate variables?
Enter fullscreen mode Exit fullscreen mode

The key insight: rollback before shifting. Don't accumulate garbage from failed attempts. Start fresh with a new hypothesis.

Lesson: Structured failure is better than random retry. Force the agent to actually change approach.

Scope Enforcement: Trust But Verify

Here's a problem: you tell an agent to "fix the auth module" and it edits your database schema. How do you prevent this?

Version 4 introduced scope constraints:

## Scope Constraint
You may ONLY modify files matching: `^src/auth/`
Any edit outside this pattern is a FATAL violation.
Enter fullscreen mode Exit fullscreen mode

But telling isn't enough. You verify after execution:

violations=$(git diff --name-only "$base_ref" | grep -vE "$scope_regex" || true)

if [ -n "$violations" ]; then
    echo "SCOPE_VIOLATION: $violations"
    # Do NOT merge this work
fi
Enter fullscreen mode Exit fullscreen mode

This is cheaper than trying to prevent bad actions. Let the agent work, then mechanically check it stayed in bounds.

Lesson: Post-execution validation catches what prompts can't prevent.

The Orchestrator Doesn't Code

At this point we split the system in two: a Boss and Workers.

The Boss has one constraint:

You do NOT edit code. You generate Process Infrastructure.

The Boss writes:

  • Work manifests (task decomposition)
  • Context files (worker instructions)
  • Shell scripts (execution pipeline)
  • Status tracking

Workers do the actual implementation. This separation prevents a nasty failure mode: the planner getting distracted by implementation details, or the implementer making architectural decisions it shouldn't.

Lesson: Planning agents shouldn't implement. Implementing agents shouldn't plan.

True Parallelism: Git Worktrees

Most "parallel agent" systems are lies. They make concurrent API calls that race on the filesystem. Agent A writes to config.json. Agent B writes to config.json. One of them loses.

Git worktrees solve this:

# W1 works in .worktrees/W1
git worktree add -b "feat/w1-auth" ".worktrees/W1" main

# W2 works in .worktrees/W2
git worktree add -b "feat/w2-ui" ".worktrees/W2" main
Enter fullscreen mode Exit fullscreen mode

They literally cannot overwrite each other's files. Each worktree is a separate directory with its own branch. Conflicts only surface at merge time—where they belong.

Lesson: If you need real parallelism, you need real isolation. Git worktrees give you this for free.

The Merge Ceremony

Parallel work has to come back together. This is where things get dangerous.

The merge ceremony is deliberate:

1. Read all worker status files
2. Triage:
   - SUCCESS → Queue for merge
   - FAILED → Log, skip
   - SCOPE_VIOLATION → Quarantine, do NOT merge
   - RUNNING → Crashed, treat as failed

3. Merge in dependency order:
   main ──●──────────●──────────● (final)
          │          │          │
          │ merge W1 │ merge W3 │ merge W2

4. Run integration tests after each merge
5. If tests fail: revert that merge, continue with others
Enter fullscreen mode Exit fullscreen mode

This preserves partial success. If 3 of 4 workers succeeded, you get 3 of 4 features. The failed one is documented, not lost.

Lesson: Merge is where parallel work gets dangerous. Make it explicit, ordered, and recoverable.

Files as Database, Markdown as API

One philosophy emerged across all versions:

Files are database—no hidden state.
Markdown as API—plans and logs are readable/editable.

When an agent writes its plan to .zen/plan.md, a human can read it. And edit it. And the agent will follow the edits.

When state is in a log file, you debug by reading a file—not parsing stdout or searching through API logs.

Lesson: Human-readable state is debuggable state. Binary formats and hidden state are where agents go to die.

The Collaboration vs. Isolation Tradeoff

We ended up with two systems that solve different problems.

The Team system (personas in shared context):

  • Members can hear each other and collaborate
  • "Neo, what do you think of Peter's plan?"
  • Flexible, conversational handoffs
  • But: not truly parallel, no hard isolation

The Boss/Worker system (isolated processes):

  • True parallelism via worktrees
  • Hard scope enforcement
  • But: workers can't talk to each other
  • Heavy ceremony (manifests, scripts, status files)

The honest answer: there's no free lunch.

Parallel agents can't collaborate—that's what makes them parallel. Collaborating agents can't parallelize—they need shared context to interact.

Pick based on your task:

  • Independent features in different directories? Boss/Worker.
  • Design decisions that need discussion? Team.
  • Single coherent task? Neither—just run one agent.

What Would We Build Next?

A hybrid. Take Zen's simplicity (single .zen/ directory, --retry for failures, editable plan.md) and add:

  1. Scope enforcement via regex — post-execution validation
  2. Pre-execution rollback tagsgit tag pre-zen-$(date +%s)
  3. Tiered autonomy — pause before destructive operations
  4. Strategy shift protocol — after 5 failures, force a new approach

The ceremony of Boss/Worker isn't worth it for most tasks. But the safety patterns are worth stealing.

The Patterns That Stuck

After a week of iteration and brutal reviews, these survived:

Pattern Why It Works
State in files Resumable, auditable, debuggable
TDD as acceptance Binary "done" signal
Tiered autonomy Not all actions are equal
Escape hatch Agents need to know when to quit
Strategy shift Force lateral thinking after failure
Scope enforcement Trust but verify
Boss doesn't code Separate planning from implementation
Worktree isolation Real parallelism, not races
Merge ceremony Ordered, recoverable integration
Markdown as API Humans can read and override

None of these are novel. They're engineering basics—state machines, separation of concerns, fail-safe defaults, mechanical verification.

The insight is that agent systems need these basics more than traditional software does. LLMs are unpredictable. They drift. They hallucinate. They retry the same failed approach forever.

Constraints, state machines, and mechanical verification are how you build something reliable out of something unpredictable.


This research was conducted by iteratively prompting, reviewing, and refining agent system designs. The reviews were harsh. The designs got better. That's the process.

Top comments (1)

Collapse
 
aliengen-slave profile image
Timothé Mermet-Buffet

Really interesting article!