DEV Community

Cover image for The Reality of "Autonomous" Multi-Agent Development
Aviad Rozenhek
Aviad Rozenhek

Posted on

The Reality of "Autonomous" Multi-Agent Development

What we learned when 5 "independent" AI agents needed constant human orchestration

Part 2 of the Multi-Agent Development Series


TL;DR

We set out to prove AI agents could work independently with zero coordination. The zero-conflict architecture worked perfectly (100% auto-merge). But "autonomous" agents? That was the illusion.

What actually happened:

  • ✅ Zero merge conflicts (architecture win)
  • ❌ Constant human orchestration required (autonomy myth)
  • ✅ Property tests found a 12x efficiency bug (agent + human win)
  • ❌ But only because human asked the right questions (human essential)

The real lesson: The value isn't "autonomous agents." It's well-orchestrated human-AI collaboration.


The Setup (Brief Recap)

In Part 1, we described an experiment:

  • 5 parallel AI coding agents
  • Each working on independent test improvements
  • Zero-conflict architecture (file-level ownership)
  • Hypothesis: Agents work independently, human just merges

What we expected: Spawn 5 agents, come back in 48 hours, merge everything.

What we got: 8 hours of active human orchestration disguised as "autonomous" work.


The Orchestration Reality

Manual Intervention #1: Tool Switching

The Problem:

Integration Agent (Claude Web): "I'll check the PR status using gh CLI..."
System: Error - 'gh' command not found
Enter fullscreen mode Exit fullscreen mode

What happened:

  • Started integration in Claude Code Web (no GitHub CLI access)
  • Needed to switch to Claude Code CLI mid-experiment
  • Lost context, had to re-establish state
  • Broke workflow continuity

Human intervention:

  1. Realized Web agent couldn't access GitHub API
  2. Switched to CLI environment
  3. Re-explained context and state
  4. Continued from checkpoint

What "autonomous" would look like: Agent detects tool unavailability, switches environments automatically, preserves state.

Reality: Human orchestrated the environment switch.


Manual Intervention #2: Agent Coordination

The Problem: Agents don't know when other agents finish their work.

Actual workflow:

Human: "Pull from PR-2 now"
Agent: [pulls and merges PR-2]
Human: "Now pull from PR-5"
Agent: [pulls and merges PR-5]
Human: "Fix the test failures"
Agent: [investigates and fixes]
Human: "Push"
Agent: [pushes]
Enter fullscreen mode Exit fullscreen mode

Frequency: Continuous throughout integration

What "autonomous" would look like: Integration agent monitors PR branches via GitHub API, detects completion, merges automatically.

Reality: Human manually coordinated every merge.


Manual Intervention #3: Reality Checks

The Problem: Agents made incorrect assumptions about APIs that don't exist.

Examples:

# Agent assumed this API exists:
from gv.ai.video_moderation_service.agent.scenario_builder import ScenarioBuilder
builder = ScenarioBuilder(time_provider=time_provider)

# Reality: Different module, different class name:
from tests.test_video_moderation.simulation.scenario_builder import Scenario
scenario = Scenario("test_name")
Enter fullscreen mode Exit fullscreen mode
# Agent assumed this field exists:
result.details.classification == VideoModerationClassification.SAFE

# Reality: Field was removed from model:
result.details.is_appropriate == True  # This is the actual field
Enter fullscreen mode Exit fullscreen mode

Impact:

  • 13 test failures from API mismatches
  • ~2 hours debugging speculative code
  • Wasted agent effort writing code that couldn't work

Human intervention:

  1. Noticed tests failing with "AttributeError: no attribute 'classification'"
  2. Checked actual model structure: python -c "from model import X; print(dir(X()))"
  3. Realized agent wrote tests for outdated/assumed APIs
  4. Corrected agent: "This field doesn't exist, use this instead"

What "autonomous" would look like: Agent introspects models before coding, verifies imports work.

Reality: Human caught and corrected incorrect assumptions.


The Scorecard: Autonomous vs Orchestrated

Task Designed to be Autonomous? Actually Autonomous? Reality
Branch merging ✅ Yes ❌ No Human triggered each merge manually
Dependency install ✅ Yes ❌ No Human ran uv sync 10+ times
Tool selection ✅ Yes ❌ No Human switched Web → CLI when gh failed
API verification ✅ Yes ❌ No Human corrected incorrect assumptions
PR coordination ✅ Yes ❌ No Human directed "pull from PR-X now"
Test execution ✅ Yes YES Agent ran tests independently ✅
Code writing ✅ Yes YES Agent wrote code without guidance ✅
Bug fixing ✅ Yes 🟡 PARTIAL Agent fixed all bugs, human had to ask for reflection and provide metathinking for more complex ones

Autonomy Score: 2.5 / 8 tasks (31%)

Real workflow: Human-orchestrated parallel development with AI assistants, not autonomous multi-agent coordination.


The Human-AI Sweet Spot

Now for the surprising success story.

The Bug Nobody Noticed

PR-4 (property-based testing) ran 7,000+ random test scenarios. All passed.

Agent reported: "All property tests passing! 7000 scenarios validated."

User asked: "Why is budget utilization so high?"

This question changed everything.


The Investigation (Human-Led, AI-Executed)

User's observation:

Policy requires: 10 checks/minute (1 check per participant per 60s)
Budget provided: 1000 checks/minute (100x excess)

Expected usage: ~10 checks/min (policy requirement)
Actual usage: ~120 checks/min (12x waste!)

Where's the waste coming from?
Enter fullscreen mode Exit fullscreen mode

Agent's response: "Let me investigate the budget allocation logic..."

What the agent found:

# Result: Participants checked EVERY 5s instead of EVERY 60s
Enter fullscreen mode Exit fullscreen mode

Root cause: No logic to skip participants far from their deadline.


The Impact

Scenario Expected Actual Waste
10 participants, 60s recheck, 1000 budget 10 checks/min 120 checks/min 12x
Cost per check: $0.01 $0.10/min $1.20/min $1.10/min wasted
Monthly cost $4,320 $51,840 $47,520 wasted

This is a production bug costing real money.


Why Agent Didn't Notice (But Human Did)

Agent perspective:

  • Ran 7000 scenarios ✅
  • All invariants held (budget never exceeded) ✅
  • No test failures ✅
  • Declared success ✅

Human perspective:

  • "Wait, why are we using 12x more budget than policy requires?"
  • "That's not wrong, but it's wasteful"
  • "Let's investigate"

The gap: Agent optimized for correctness (invariants hold). Human optimized for efficiency (use only what's needed).


The Lesson: Complementary Strengths

Capability Agent Human
Run 7000 scenarios ✅ Excellent ❌ Too slow
Detect invariant violations ✅ Excellent 🟡 Might miss edge cases
Notice efficiency waste ❌ Didn't catch it ✅ Spotted immediately
Ask "why?" ❌ Accepted results ✅ Questioned assumptions
Deep investigation ✅ Excellent (once directed) 🟡 Tedious

The pattern:

  1. Agent provides breadth (7000 scenarios)
  2. Human provides depth (critical thinking)
  3. Human directs investigation ("check budget utilization")
  4. Agent executes investigation (traces through code)
  5. Together: Find bugs that neither would find alone

What "Autonomous" Would Actually Require

Based on what we learned, here's what truly autonomous multi-agent development needs:

Inter-Agent Communication

# Integration agent monitors other agents
for pr in watch_prs():
    if pr.status == "ready":
        merge_automatically(pr)
        run_tests()
        if tests_pass:
            notify_success()
        else:
            investigate_failures()
            notify_agent_for_fixes()
Enter fullscreen mode Exit fullscreen mode

Model Introspection Before Coding

# Agent verifies API before using
def before_writing_test(model_class):
    verify_import_works(model_class)
    actual_fields = introspect_fields(model_class)
    actual_signature = get_signature(model_class.__init__)

    # Only now write tests using ACTUAL API
    write_tests(actual_fields, actual_signature)
Enter fullscreen mode Exit fullscreen mode

Efficiency Monitoring

# Agent notices inefficiency, not just correctness
def after_property_tests_pass():
    check_invariants()  # Current behavior ✅

    # NEW: Also check efficiency
    check_resource_utilization()
    if utilization > expected * 2:
        flag_potential_waste()
        investigate_cause()
Enter fullscreen mode Exit fullscreen mode

None of these capabilities exist today in Claude Code. Hence: constant human orchestration.


What Actually Worked

Despite the orchestration requirements, we did achieve significant wins:

1. Zero-Conflict Architecture (★★★★★)

The Design:

  • Each PR owns its files completely
  • No shared file modifications
  • CREATE new files instead of MODIFY existing

The Result:

  • 100% auto-merge success rate
  • Zero manual conflict resolution
  • 5 PRs merged in 48 hours

Key Insight: File-level ownership is the gold standard for parallel work.

(More on this in Article 4: "Zero-Conflict Architecture")


2. Property-Based Testing + Human Oversight (★★★★★)

The Combination:

  • Agent runs 7000 scenarios → Validates invariants
  • Human asks "why is utilization high?" → Notices efficiency waste
  • Together → Find 12x cost bug

Key Insight: Agents provide breadth, humans provide depth.

(More on this in Article 3: "Property-Based Testing with Hypothesis")


3. Parallel Execution (★★★★☆)

Despite orchestration overhead:

  • 5 work streams completed in 8 hours
  • Estimated sequential time: 8-10 days
  • Time savings: ~75% (even with orchestration!)

The math:

  • Human orchestration time: ~6 hours
  • Agent execution time: ~42 hours (parallelized across 5 agents)
  • Total wall-clock time: 48 hours
  • Sequential alternative: 192+ hours (8 days)

Orchestration overhead: 6 hours / 48 hours = 12.5% of total time

Key Insight: Even with constant orchestration, parallel execution still delivered massive time savings.


The Honest Framing: Human-Orchestrated AI Assistants

Let's be clear about what we actually built:

What We Claimed (Part 1)

"5 independent AI agents working autonomously with zero coordination"

What We Actually Built

Human-orchestrated parallel development with AI assistants

Where:

  • Human provides:

    • Environment management (tool switching, dependency install)
    • Coordination signals ("pull from PR-X now")
    • Reality checks (API verification, efficiency monitoring)
    • Critical thinking ("why is this wasteful?")
  • AI Agents provide:

    • Code generation at scale
    • Test execution breadth (7000 scenarios)
    • Pattern implementation (once shown correct pattern)
    • Investigation depth (once directed)

The Value Proposition (Still Significant!)

Even though it's not "autonomous," the value is real:

Time Savings

  • 8 hours vs 8-10 days (75% reduction)
  • Despite 12.5% orchestration overhead

Quality Improvements

  • ~80 lines duplicate code eliminated
  • 67 new tests added
  • 1 major efficiency bug found (12x cost savings)
  • 7000+ random scenarios validated

Developer Experience

  • Human focuses on high-value activities (architecture, verification, critical thinking)
  • AI handles high-volume activities (code generation, test execution, investigation)

Recommendations for Practitioners

If you're considering multi-agent AI development:

✅ Do This

1. Design for zero conflicts

  • File-level ownership
  • CREATE new files, don't MODIFY shared ones
  • Partition existing files carefully

2. Verify before coding

  • Always introspect models: print(dir(model))
  • Test imports before using: python -c "from X import Y"
  • Check signatures: help(function)

3. Plan for orchestration

  • Budget 10-15% time for human coordination
  • Set up easy branch switching (git worktrees)
  • Automate dependency install scripts
  • Create checklists for manual steps

4. Combine agent breadth + human depth

  • Use agents for volume (7000 test scenarios)
  • Use humans for insight ("why is this wasteful?")
  • Direct agents to investigate once you notice issues

5. Track metrics honestly

  • Measure orchestration overhead (not just execution time)
  • Count manual interventions (tool switches, fixes)
  • Report actual autonomy score (31% in our case)

❌ Don't Do This

1. Assume autonomy without verification

  • Agents will make incorrect API assumptions
  • Cost: Wasted effort + debugging time

2. Skip environment setup

  • Private dependencies need manual handling
  • Tool unavailability breaks workflows

3. Expect agents to coordinate automatically

  • No inter-agent communication today
  • Budget time for manual coordination

4. Trust property tests alone

  • Agents optimize for correctness (invariants)
  • Humans must check efficiency (resource usage)

5. Over-claim autonomy

  • Be honest about orchestration requirements
  • Report actual vs autonomous time

Conclusion

The experiment proved:

  • ✅ Zero-conflict architecture works (100% auto-merge)
  • ✅ Parallel execution saves time (75% reduction)
  • ✅ Property testing + human insight finds bugs (12x efficiency bug)
  • ❌ True autonomy doesn't exist yet (31% autonomy score)

The real value:
Not "autonomous multi-agent development" but well-orchestrated human-AI collaboration where:

  • Humans provide architecture, verification, and critical thinking
  • AI provides generation, execution breadth, and investigation depth
  • Together they achieve 75% time savings despite 12.5% orchestration overhead

Would we do it again? Yes! But with realistic expectations about orchestration requirements.

Next time we'd improve:

  1. Write verification scripts (model introspection, import checking)
  2. Create orchestration checklists (reduce decision fatigue)
  3. Build efficiency monitoring (not just correctness testing)
  4. Set realistic expectations (orchestrated assistance, not autonomy)

What's Next

In the following articles, we'll dive deeper into specific aspects:


Tags: #ai #claude #multi-agent #reality-check #human-ai-collaboration #orchestration #lessons-learned #honest-reporting


This is Part 2 of the Multi-Agent Development Series. Read Part 1 for the original experiment design and optimistic hypothesis.

Discussion: What's your experience with "autonomous" AI agents? Have you found the sweet spot between human orchestration and AI execution? Share in the comments!

Top comments (0)