Aviad Rozenhek

Posted on Nov 6

The Reality of "Autonomous" Multi-Agent Development

#ai #agents #software #vibecoding

What we learned when 5 "independent" AI agents needed constant human orchestration

Part 2 of the Multi-Agent Development Series

TL;DR

We set out to prove AI agents could work independently with zero coordination. The zero-conflict architecture worked perfectly (100% auto-merge). But "autonomous" agents? That was the illusion.

What actually happened:

✅ Zero merge conflicts (architecture win)
❌ Constant human orchestration required (autonomy myth)
✅ Property tests found a 12x efficiency bug (agent + human win)
❌ But only because human asked the right questions (human essential)

The real lesson: The value isn't "autonomous agents." It's well-orchestrated human-AI collaboration.

The Setup (Brief Recap)

In Part 1, we described an experiment:

5 parallel AI coding agents
Each working on independent test improvements
Zero-conflict architecture (file-level ownership)
Hypothesis: Agents work independently, human just merges

What we expected: Spawn 5 agents, come back in 48 hours, merge everything.

What we got: 8 hours of active human orchestration disguised as "autonomous" work.

The Orchestration Reality

Manual Intervention #1: Tool Switching

The Problem:

Integration Agent (Claude Web): "I'll check the PR status using gh CLI..."
System: Error - 'gh' command not found

What happened:

Started integration in Claude Code Web (no GitHub CLI access)
Needed to switch to Claude Code CLI mid-experiment
Lost context, had to re-establish state
Broke workflow continuity

Human intervention:

Realized Web agent couldn't access GitHub API
Switched to CLI environment
Re-explained context and state
Continued from checkpoint

What "autonomous" would look like: Agent detects tool unavailability, switches environments automatically, preserves state.

Reality: Human orchestrated the environment switch.

Manual Intervention #2: Agent Coordination

The Problem: Agents don't know when other agents finish their work.

Actual workflow:

Human: "Pull from PR-2 now"
Agent: [pulls and merges PR-2]
Human: "Now pull from PR-5"
Agent: [pulls and merges PR-5]
Human: "Fix the test failures"
Agent: [investigates and fixes]
Human: "Push"
Agent: [pushes]

Frequency: Continuous throughout integration

What "autonomous" would look like: Integration agent monitors PR branches via GitHub API, detects completion, merges automatically.

Reality: Human manually coordinated every merge.

Manual Intervention #3: Reality Checks

The Problem: Agents made incorrect assumptions about APIs that don't exist.

Examples:

# Agent assumed this API exists:
from gv.ai.video_moderation_service.agent.scenario_builder import ScenarioBuilder
builder = ScenarioBuilder(time_provider=time_provider)

# Reality: Different module, different class name:
from tests.test_video_moderation.simulation.scenario_builder import Scenario
scenario = Scenario("test_name")

# Agent assumed this field exists:
result.details.classification == VideoModerationClassification.SAFE

# Reality: Field was removed from model:
result.details.is_appropriate == True  # This is the actual field

Impact:

13 test failures from API mismatches
~2 hours debugging speculative code
Wasted agent effort writing code that couldn't work

Human intervention:

Noticed tests failing with "AttributeError: no attribute 'classification'"
Checked actual model structure: python -c "from model import X; print(dir(X()))"
Realized agent wrote tests for outdated/assumed APIs
Corrected agent: "This field doesn't exist, use this instead"

What "autonomous" would look like: Agent introspects models before coding, verifies imports work.

Reality: Human caught and corrected incorrect assumptions.

The Scorecard: Autonomous vs Orchestrated

Task	Designed to be Autonomous?	Actually Autonomous?	Reality
Branch merging	✅ Yes	❌ No	Human triggered each merge manually
Dependency install	✅ Yes	❌ No	Human ran `uv sync` 10+ times
Tool selection	✅ Yes	❌ No	Human switched Web → CLI when gh failed
API verification	✅ Yes	❌ No	Human corrected incorrect assumptions
PR coordination	✅ Yes	❌ No	Human directed "pull from PR-X now"
Test execution	✅ Yes	✅ YES	Agent ran tests independently ✅
Code writing	✅ Yes	✅ YES	Agent wrote code without guidance ✅
Bug fixing	✅ Yes	🟡 PARTIAL	Agent fixed all bugs, human had to ask for reflection and provide metathinking for more complex ones

Autonomy Score: 2.5 / 8 tasks (31%)

Real workflow: Human-orchestrated parallel development with AI assistants, not autonomous multi-agent coordination.

The Human-AI Sweet Spot

Now for the surprising success story.

The Bug Nobody Noticed

PR-4 (property-based testing) ran 7,000+ random test scenarios. All passed.

Agent reported: "All property tests passing! 7000 scenarios validated."

User asked: "Why is budget utilization so high?"

This question changed everything.

The Investigation (Human-Led, AI-Executed)

User's observation:

Policy requires: 10 checks/minute (1 check per participant per 60s)
Budget provided: 1000 checks/minute (100x excess)

Expected usage: ~10 checks/min (policy requirement)
Actual usage: ~120 checks/min (12x waste!)

Where's the waste coming from?

Agent's response: "Let me investigate the budget allocation logic..."

What the agent found:

# Result: Participants checked EVERY 5s instead of EVERY 60s

Root cause: No logic to skip participants far from their deadline.

The Impact

Scenario	Expected	Actual	Waste
10 participants, 60s recheck, 1000 budget	10 checks/min	120 checks/min	12x
Cost per check: $0.01	$0.10/min	$1.20/min	$1.10/min wasted
Monthly cost	$4,320	$51,840	$47,520 wasted

This is a production bug costing real money.

Why Agent Didn't Notice (But Human Did)

Agent perspective:

Ran 7000 scenarios ✅
All invariants held (budget never exceeded) ✅
No test failures ✅
Declared success ✅

Human perspective:

"Wait, why are we using 12x more budget than policy requires?"
"That's not wrong, but it's wasteful"
"Let's investigate"

The gap: Agent optimized for correctness (invariants hold). Human optimized for efficiency (use only what's needed).

The Lesson: Complementary Strengths

Capability	Agent	Human
Run 7000 scenarios	✅ Excellent	❌ Too slow
Detect invariant violations	✅ Excellent	🟡 Might miss edge cases
Notice efficiency waste	❌ Didn't catch it	✅ Spotted immediately
Ask "why?"	❌ Accepted results	✅ Questioned assumptions
Deep investigation	✅ Excellent (once directed)	🟡 Tedious

The pattern:

Agent provides breadth (7000 scenarios)
Human provides depth (critical thinking)
Human directs investigation ("check budget utilization")
Agent executes investigation (traces through code)
Together: Find bugs that neither would find alone

What "Autonomous" Would Actually Require

Based on what we learned, here's what truly autonomous multi-agent development needs:

Inter-Agent Communication

# Integration agent monitors other agents
for pr in watch_prs():
    if pr.status == "ready":
        merge_automatically(pr)
        run_tests()
        if tests_pass:
            notify_success()
        else:
            investigate_failures()
            notify_agent_for_fixes()

Model Introspection Before Coding

# Agent verifies API before using
def before_writing_test(model_class):
    verify_import_works(model_class)
    actual_fields = introspect_fields(model_class)
    actual_signature = get_signature(model_class.__init__)

    # Only now write tests using ACTUAL API
    write_tests(actual_fields, actual_signature)

Efficiency Monitoring

# Agent notices inefficiency, not just correctness
def after_property_tests_pass():
    check_invariants()  # Current behavior ✅

    # NEW: Also check efficiency
    check_resource_utilization()
    if utilization > expected * 2:
        flag_potential_waste()
        investigate_cause()

None of these capabilities exist today in Claude Code. Hence: constant human orchestration.

What Actually Worked

Despite the orchestration requirements, we did achieve significant wins:

1. Zero-Conflict Architecture (★★★★★)

The Design:

Each PR owns its files completely
No shared file modifications
CREATE new files instead of MODIFY existing

The Result:

100% auto-merge success rate
Zero manual conflict resolution
5 PRs merged in 48 hours

Key Insight: File-level ownership is the gold standard for parallel work.

(More on this in Article 4: "Zero-Conflict Architecture")

2. Property-Based Testing + Human Oversight (★★★★★)

The Combination:

Agent runs 7000 scenarios → Validates invariants
Human asks "why is utilization high?" → Notices efficiency waste
Together → Find 12x cost bug

Key Insight: Agents provide breadth, humans provide depth.

(More on this in Article 3: "Property-Based Testing with Hypothesis")

3. Parallel Execution (★★★★☆)

Despite orchestration overhead:

5 work streams completed in 8 hours
Estimated sequential time: 8-10 days
Time savings: ~75% (even with orchestration!)

The math:

Human orchestration time: ~6 hours
Agent execution time: ~42 hours (parallelized across 5 agents)
Total wall-clock time: 48 hours
Sequential alternative: 192+ hours (8 days)

Orchestration overhead: 6 hours / 48 hours = 12.5% of total time

Key Insight: Even with constant orchestration, parallel execution still delivered massive time savings.

The Honest Framing: Human-Orchestrated AI Assistants

Let's be clear about what we actually built:

What We Claimed (Part 1)

"5 independent AI agents working autonomously with zero coordination"

What We Actually Built

Human-orchestrated parallel development with AI assistants

Where:

Human provides:
- Environment management (tool switching, dependency install)
- Coordination signals ("pull from PR-X now")
- Reality checks (API verification, efficiency monitoring)
- Critical thinking ("why is this wasteful?")
AI Agents provide:
- Code generation at scale
- Test execution breadth (7000 scenarios)
- Pattern implementation (once shown correct pattern)
- Investigation depth (once directed)

The Value Proposition (Still Significant!)

Even though it's not "autonomous," the value is real:

Time Savings

8 hours vs 8-10 days (75% reduction)
Despite 12.5% orchestration overhead

Quality Improvements

~80 lines duplicate code eliminated
67 new tests added
1 major efficiency bug found (12x cost savings)
7000+ random scenarios validated

Developer Experience

Human focuses on high-value activities (architecture, verification, critical thinking)
AI handles high-volume activities (code generation, test execution, investigation)

Recommendations for Practitioners

If you're considering multi-agent AI development:

✅ Do This

1. Design for zero conflicts

File-level ownership
CREATE new files, don't MODIFY shared ones
Partition existing files carefully

2. Verify before coding

Always introspect models: print(dir(model))
Test imports before using: python -c "from X import Y"
Check signatures: help(function)

3. Plan for orchestration

Budget 10-15% time for human coordination
Set up easy branch switching (git worktrees)
Automate dependency install scripts
Create checklists for manual steps

4. Combine agent breadth + human depth

Use agents for volume (7000 test scenarios)
Use humans for insight ("why is this wasteful?")
Direct agents to investigate once you notice issues

5. Track metrics honestly

Measure orchestration overhead (not just execution time)
Count manual interventions (tool switches, fixes)
Report actual autonomy score (31% in our case)

❌ Don't Do This

1. Assume autonomy without verification

Agents will make incorrect API assumptions
Cost: Wasted effort + debugging time

2. Skip environment setup

Private dependencies need manual handling
Tool unavailability breaks workflows

3. Expect agents to coordinate automatically

No inter-agent communication today
Budget time for manual coordination

4. Trust property tests alone

Agents optimize for correctness (invariants)
Humans must check efficiency (resource usage)

5. Over-claim autonomy

Be honest about orchestration requirements
Report actual vs autonomous time

Conclusion

The experiment proved:

✅ Zero-conflict architecture works (100% auto-merge)
✅ Parallel execution saves time (75% reduction)
✅ Property testing + human insight finds bugs (12x efficiency bug)
❌ True autonomy doesn't exist yet (31% autonomy score)

The real value:
Not "autonomous multi-agent development" but well-orchestrated human-AI collaboration where:

Humans provide architecture, verification, and critical thinking
AI provides generation, execution breadth, and investigation depth
Together they achieve 75% time savings despite 12.5% orchestration overhead

Would we do it again? Yes! But with realistic expectations about orchestration requirements.

Next time we'd improve:

Write verification scripts (model introspection, import checking)
Create orchestration checklists (reduce decision fatigue)
Build efficiency monitoring (not just correctness testing)
Set realistic expectations (orchestrated assistance, not autonomy)

What's Next

In the following articles, we'll dive deeper into specific aspects:

Article 3: Property-Based Testing with Hypothesis: The Data You're Throwing Away
- How we ran 7000 scenarios but almost lost all the insights
- The policy that changed everything
Article 4: Zero-Conflict Architecture: The 80/20 of Parallel Development
- The one design decision that eliminated all merge conflicts
- What zero-conflict doesn't solve (integration correctness)
Article 5: Communication Protocols for AI Agents That Can't Talk
- 4 iterations to get file-based messaging working
- What worked and what didn't
Article 6: The Budget Calculator Paradox
- What we learned from flip-flopping 8 times on a formula
- Build the calculator first, use it everywhere

Tags: #ai #claude #multi-agent #reality-check #human-ai-collaboration #orchestration #lessons-learned #honest-reporting

This is Part 2 of the Multi-Agent Development Series. Read Part 1 for the original experiment design and optimistic hypothesis.

Discussion: What's your experience with "autonomous" AI agents? Have you found the sweet spot between human orchestration and AI execution? Share in the comments!

DEV Community