DEV Community

Cover image for Parallel AI Development: Can 5 Claude Code Agents Work Independently?
Aviad Rozenhek
Aviad Rozenhek

Posted on

Parallel AI Development: Can 5 Claude Code Agents Work Independently?

An experiment in coordinating multiple AI coding agents to refactor a complex test suite


The Setup

You've been working on a complex system for months. It's non-trivial - maybe it's a real-time policy-driven scheduling system, maybe it's a distributed data pipeline, maybe it's a state machine with temporal dependencies. The kind of system where simple unit tests aren't enough.

Your test suite works, but after months of iteration:

  • ✅ Good coverage (95%+)
  • ✅ Tests pass reliably
  • ⚠️ 200+ lines of duplicated test helpers
  • ⚠️ Tests break frequently during refactoring (40+ times in git history)
  • ⚠️ Some critical gaps in coverage
  • ⚠️ Slow feedback loops (some logic only tested through expensive integration tests)

You need to improve the test suite. Estimated time: 8-10 days of focused work.

But then you wonder:

"What if I spawn 5 parallel Claude Code sessions, each working on independent improvements simultaneously?"

The Experiment

Hypothesis: Parallel execution of test improvements using multiple AI coding agents will:

  1. Reduce total time to 24-48 hours (vs 8-10 days sequential) - 70-85% reduction
  2. Maintain quality with zero merge conflicts (by design)
  3. Create emergent value through cross-PR integration
  4. Prove viability of autonomous AI agents working with clear specs

Null Hypothesis: Parallel execution provides no time benefit and introduces coordination complexity that negates any gains.

The setup:

  • 5 parallel "worker" PRs, each improving a different aspect
  • 1 "integration" PR that merges continuously and exploits combinations
  • Each agent works independently with zero knowledge of others (except what's in git)
  • Comprehensive metrics tracking throughout

What we're really testing: Can AI agents work as independent team members if given clear, well-defined tasks?

The Design Innovations

Innovation 1: CREATE + EXPLOIT + PROVE Pattern

Early in planning, we realized a critical flaw: What if a PR creates a capability but never uses it? How would we know it works?

The naive approach:

PR-1: Create test builders
PR-2: Create budget tests
PR-3: Create property tests
...
PR-6 (later): Update everything to use new builders
Enter fullscreen mode Exit fullscreen mode

Problem: We don't know if builders work until PR-6. That's not parallel - that's deferred integration.

Our solution: Every PR must CREATE + EXPLOIT + PROVE

Phase What It Means Example
CREATE Build the capability Create test_builders.py
EXPLOIT Use it extensively in SAME PR Update 15-20 test files immediately
PROVE Demonstrate measurable value Eliminate 200 lines, all tests pass

This ensures:

  • ✅ Capabilities validated immediately (not deferred)
  • ✅ Integration issues surface during development (not later)
  • ✅ Each PR can be reviewed independently
  • ✅ Metrics prove value (not theoretical)

This is the key to true parallel development: Each stream completes end-to-end, not just one phase.

Innovation 2: Zero-Conflict Architecture

Challenge: How do 5 parallel PRs avoid conflicts?

Solution: Each PR "owns" its files - no shared modifications.

Work Stream What It Creates/Modifies Conflict Risk
PR-1: Test Builders Creates test_builders.py + updates unit tests ✅ None
PR-2: Budget Tests Creates test_budget_allocation.py ✅ None
PR-3: Segment Tests Creates test_segment_duration.py ✅ None
PR-4: Property Tests Creates test_properties.py ✅ None
PR-5: Placement Tests Creates test_placement_changes.py ✅ None
PR-6: Integration Creates cross-PR integration tests ✅ None - runs after merges

The key insight: Conflict-free by construction. No merge conflict resolution needed.

How to achieve this:

  1. Create new files for new capabilities (never modify same files)
  2. Partition existing files carefully (PR-1 updates unit tests, PR-5 updates integration tests)
  3. Defer shared changes to integration PR (final cleanup, cross-cutting concerns)

Innovation 3: Parallel Integration Conductor

Naive approach: Integration waits for all PRs to finish, then merges sequentially.

Problem: Integration creates no value, just combines others' work. Also, it's not truly parallel.

Our breakthrough: Integration runs in parallel from Hour 0 as:

  • Active merger: Continuously watches all PRs, merges immediately when complete (1-4 hour latency)
  • Cross-PR exploiter: Creates capabilities that require MULTIPLE PRs (emergent value)
  • Feedback provider: Detects integration issues early, communicates back to workers

The Emergent Value: Cross-PR Exploitation

After PR-1 (builders) and PR-2 (budget tests) merge, integration creates:

# test_cross_pr_budget_with_builders.py
# This test can ONLY exist after PR-1 + PR-2 merge!

from test_builders import make_participant  # From PR-1
from test_budget_allocation import verify_tier_priority  # From PR-2

def test_tier_priority_with_elegant_setup():
    """Proves PR-1 + PR-2 integrate correctly"""
    # PR-1's builders make setup elegant
    participants = [
        make_participant("user1", tier=1, urgency="critical"),
        make_participant("user2", tier=2, urgency="normal"),
        make_participant("user3", tier=3, urgency="overdue")
    ]

    # PR-2's logic validates ordering
    result = verify_tier_priority(participants)

    # Proves: builders work + tier logic works + integration works
    assert result.order == ["user1", "user2", "user3"]
Enter fullscreen mode Exit fullscreen mode

Why this matters: This test validates integration. It couldn't be written in PR-1 (no tier logic yet) or PR-2 (no builders yet). It can ONLY exist in integration.

Cross-PR Exploitation Timeline

Timeline Merged PRs Integration Creates Value
Hour 4-8 PR-1 N/A (baseline) Test builders available
Hour 8-12 PR-1 + PR-2 Budget tests using builders Proves builders + budget logic work together
Hour 12-16 PR-1 + PR-3 Segment tests using builders Proves builders + segment logic work together
Hour 16-20 PR-1 + PR-4 Property tests generate 1000+ scenarios with builders Massive test coverage with elegant setup
Hour 20-24 PR-1 + PR-5 Placement tests using builders Critical safety validation
Hour 24-48 ALL test_full_integration.py using ALL 5 capabilities Full system validation

The final integration test uses capabilities from all 5 PRs simultaneously. This is the proof that everything works together.

Innovation 4: Dual-Channel Feedback

Problem: If integration detects issues in PR-2, how does it communicate quickly?

Solution: Two communication channels

Channel 1: Detailed FEEDBACK File

# FEEDBACK-PR-2.md

## Summary
✅ Merged successfully at Hour 8
⚠️ Found 2 issues during cross-PR testing

## Issue 1: Type Error in Budget Tests
**File**: `test_budget_allocation.py:45`
**Error**: `TypeError: expected Enum, got str`
**Fix**: Import `Priority` enum, use `Priority.HIGH` instead of "high"
**Code**:
Enter fullscreen mode Exit fullscreen mode


python

Current (line 45)

participant = make_participant("user1", priority="high") # ❌

Fixed

from models import Priority
participant = make_participant("user1", priority=Priority.HIGH) # ✅


## Issue 2: Missing Edge Case
**Test**: `test_tier_priority_boundary`
**Gap**: Doesn't test participant at EXACT threshold
**Recommendation**: Add test case for `staleness == THRESHOLD`

## Cross-PR Test Results
✅ test_budget_with_builders.py: 5/5 passing
❌ test_property_budget.py: 2/100 failing (edge case)
Enter fullscreen mode Exit fullscreen mode

Channel 2: GitHub @mention for Immediacy

@claude

## PR-2 Integration Issues ⚠️

2 issues found during cross-PR testing. See detailed analysis in `FEEDBACK-PR-2.md`.

**Quick summary**:
1. Type error in line 45 (enum vs string) - Easy fix
2. Missing edge case test for threshold boundary

**Action needed**: Please fix within 4 hours if possible. Not blocking other PRs.

**Tests to run**: `pytest tests/test_budget_allocation.py -v`
Enter fullscreen mode Exit fullscreen mode

Why dual-channel?

  • FEEDBACK file: Complete technical details, code snippets, full context
  • GitHub @mention: Immediate notification, quick summary, clear action items

This mimics real team communication: "Hey @teammate, urgent issue - check the detailed doc I wrote."

The Work Breakdown

PR-1: Test Builders (4-8 hours)

Problem: Test helpers duplicated across 15+ files (200+ lines of duplication)

CREATE: Centralized test_builders.py

# Semantic builders for domain concepts
def make_participant(user_id, placement, risk_level=0.5, **kwargs):
    """Create participant with sensible defaults"""
    return ParticipantState(user_id=user_id, placement=placement, ...)

def make_result(user_id, classification, timestamp, **kwargs):
    """Create moderation result with context"""
    return ModerationResult(user_id=user_id, classification=classification, ...)
Enter fullscreen mode Exit fullscreen mode

EXPLOIT: Update 15-20 test files to use builders

PROVE:

  • Lines eliminated: 180-200
  • Test coverage: Maintained 95%+
  • All tests passing

PR-2: Budget Allocation Unit Tests (6-12 hours)

Problem: Complex priority/budget logic only tested through slow integration tests

CREATE: Unit tests for core allocation logic (15+ tests)

EXPLOIT: Run in CI for fast feedback (milliseconds vs seconds)

PROVE:

  • Fast feedback loop for critical business logic
  • Catches regressions before expensive integration tests
  • Documents the priority algorithm

PR-3: Segment Duration Tests (3-6 hours)

Problem: Cost optimization logic (grouping) not tested

CREATE: Tests for segment grouping behavior (8+ tests)

EXPLOIT: Validate cost savings in realistic scenarios

PROVE: Confirms grouping reduces API calls by 60%

PR-4: Property-Based Tests (6-12 hours)

Problem: Fixed test scenarios may miss edge cases

CREATE: Property tests using hypothesis library

from hypothesis import given, strategies as st

@given(
    participants=st.lists(st.builds(Participant), min_size=10, max_size=100),
    budget=st.integers(min_value=1, max_value=50)
)
def test_budget_never_exceeded(participants, budget):
    """Property: budget NEVER exceeded regardless of inputs"""
    result = apply_budget(participants, budget)
    assert len(result.selected) <= budget
Enter fullscreen mode Exit fullscreen mode

EXPLOIT: Generate and test 1000+ random scenarios automatically

PROVE: Discovers edge cases that fixed tests miss

PR-5: Placement Change Tests (4-8 hours)

Problem: Critical gap - dynamic state changes not tested

CREATE: Tests for state change detection (10+ tests)

EXPLOIT: Validate against production scenarios

PROVE: Catches safety violations before they reach users

PR-6: Integration (24-48 hours PARALLEL)

Role: Continuous merger + cross-PR exploiter + feedback provider

Starts Hour 0 (not sequential at end!)

Activities:

  • Hours 0-24: Watch all PR branches, merge immediately when ready
  • Hours 8-48: Create cross-PR integration tests as capabilities combine
  • Continuous: Provide dual-channel feedback on integration issues
  • Hours 40-48: Final verification, metrics report, PR to main

Creates: 5-10 cross-PR integration tests that couldn't exist in individual PRs

The Recipe: How to Replicate This

Want to try this on your codebase? Here's the complete workflow:

Step 1: Deep Analysis (1-2 hours)

Prompt:

Analyze the test suite for [YOUR_SYSTEM] looking for:

1. Gaps in coverage (what's not tested?)
2. Overlaps and duplication (what's tested redundantly?)
3. Useful test harnesses that would help
4. Opportunities for parameterization
5. Library changes that improve testability

Create a comprehensive report with:
- Git history analysis (what broke and why?)
- Cross-file analysis (where's the duplication? quantify it)
- Prioritized recommendations by impact

Focus on MEANINGFUL IMPROVEMENTS only (not trivial line reductions).
Enter fullscreen mode Exit fullscreen mode

Output: Comprehensive analysis document (ours: 1,231 lines) with:

  • Quantified duplication ("107 instances across 15 files")
  • Git history patterns ("tests broke 40+ times during schema changes")
  • Specific recommendations with line numbers
  • Prioritization by impact

Step 2: Parallel Execution Plan (2-4 hours)

Prompt:

Transform the analysis into an executable parallel plan:

1. MASTER DOCUMENT:
   - Overall plan and timeline
   - 5 parallel work streams (PRs) with clear names
   - Why they can run in parallel (zero conflicts)
   - Success criteria

2. INDIVIDUAL TASK DOCUMENTS (one per PR):
   - What to CREATE (the capability)
   - How to EXPLOIT it (use extensively in SAME PR)
   - How to PROVE it worked (measurable metrics)
   - Files to modify (ensure no conflicts with other PRs)
   - Success criteria and metrics template

3. INTEGRATION DOCUMENT:
   - How to merge continuously (don't wait for all PRs)
   - What cross-PR exploitations to create (emergent value)
   - How to provide feedback (dual-channel: file + @mention)

REQUIREMENTS:
- Each PR must CREATE + EXPLOIT + PROVE (not just create)
- Zero conflicts by design (each PR owns its files)
- Integration runs in parallel from Hour 0 (not sequential)
- Integration creates emergent value (combinations)
- Comprehensive metrics collection
Enter fullscreen mode Exit fullscreen mode

Key refinement prompts (iterate on initial plan):

Refinement 1:
"Each PR should not only CREATE a capability but also EXPLOIT IT.
That way each PR can PROVE it did something useful and not buggy."

Refinement 2:
"The integration PR should work in PARALLEL to all other PRs
and actively merge their work. If there are problems it can write
feedback in a FEEDBACK.md file on that PR."

Refinement 3:
"Give the integration PR a job to integrate/exploit new capabilities
that couldn't be used across PRs. If it sees problems it should report
via FEEDBACK.md file AND post a comment mentioning @claude with
specific instructions."
Enter fullscreen mode Exit fullscreen mode

Output:

  • Master plan with zero-conflict design
  • 5+ PR task documents with CREATE + EXPLOIT + PROVE structure
  • Integration conductor document with cross-PR exploitation plan
  • Metrics template

Step 3: Execution (24-48 hours)

Setup:

  1. Create master work ticket
  2. Create 5 sub-tickets (one per worker PR)
  3. Create 1 integration ticket
  4. Spawn 6 parallel Claude Code sessions

Key principle: Each AI agent works independently with zero knowledge of other sessions (except what's in git). This tests autonomous operation.

Each session gets:

  • Task document (e.g., PR-1-TEST-BUILDERS.md)
  • Metrics template
  • Clear success criteria

Integration session:

  • Starts immediately (Hour 0)
  • Watches all PR branches
  • Merges within 1-4 hours of PR completion
  • Creates cross-PR tests
  • Provides dual-channel feedback

Step 4: Metrics Collection (Throughout)

Track everything:

Code Metrics:

  • Lines added/removed/net change
  • Files created/modified/deleted
  • Duplication eliminated (before/after counts)

Test Metrics:

  • Tests added/modified/removed
  • Coverage before/after (maintain or improve)
  • Execution time delta (faster feedback?)

Time Metrics:

  • Start/end timestamps
  • Duration breakdown (planning/coding/testing/debugging)
  • Actual vs estimated time
  • Parallel efficiency = (Sequential estimate) / (Actual time)

Quality Metrics:

  • Gaps closed (critical/important/nice-to-have)
  • Brittleness reduced (measure: tests broken during refactoring)
  • Developer experience (qualitative assessment)

Experiment Metrics (the key ones):

  • Total time: actual vs sequential estimate
  • Merge conflicts: count and resolution time (target: zero)
  • Cross-PR value: tests that couldn't exist in individual PRs
  • Feedback effectiveness: response time, issue resolution rate
  • Coordination overhead: integration time vs time saved

Step 5: Analysis (After completion)

Quantitative analysis:

  • Did we hit the 70-85% time reduction target?
  • Was conflict count actually zero?
  • How much emergent value from cross-PR tests?
  • What was parallel efficiency?

Qualitative analysis:

  • What worked surprisingly well?
  • What failed unexpectedly?
  • Where were hidden dependencies?
  • Was dual-channel feedback effective?
  • Would we do this again? What would we change?

The Hypothesis Matrix

Question Measurement Our Hypothesis How We'll Know
Does parallel execution save time? Total time vs sequential baseline 24-48 hours vs 8-10 days (70-85% reduction) Compare actual completion time
Does it introduce conflicts? Merge conflict count Zero conflicts by design Count merge conflicts encountered
Does it maintain quality? Coverage, duplication, gaps closed Equal or better quality Compare metrics before/after
Does cross-PR create value? Number of integration tests 5-10 tests that couldn't exist in individual PRs Count cross-PR tests created
Is coordination overhead worth it? Integration PR time vs time saved Integration creates value beyond coordination Compare integration effort to time saved

Why This Matters

This experiment tests a fundamental question:

"Can we parallelize software development by giving independent AI agents independent, well-defined tasks?"

If it works, it suggests:

  • Complex refactoring can be parallelized (not inherently sequential)
  • AI agents can work autonomously with clear specs (minimal coordination)
  • Coordination overhead can be minimized (detailed specs > real-time communication)
  • Development velocity scales horizontally (more agents = faster completion)
  • Emergent value from integration (whole > sum of parts)

If it doesn't work, we learn:

  • Where parallel development breaks down (hidden dependencies, interface coupling)
  • What conflicts are unavoidable (shared state, cross-cutting concerns)
  • Required coordination overhead (feedback loops, iteration cycles)
  • Whether complexity is worth speed (diminishing returns, coordination cost)
  • Optimal task decomposition level (too granular vs too coarse)

Either way, we get data.

What Makes This Different

This isn't just "use AI to write code faster." This is:

A systematic workflow for parallel AI development:

  • Conflict-free architecture (zero-conflict by design)
  • CREATE + EXPLOIT + PROVE pattern (end-to-end validation)
  • Parallel integration conductor (emergent value)
  • Dual-channel feedback (fast communication)
  • Comprehensive metrics (measure everything)

A reproducible recipe anyone can follow:

  • Step-by-step prompts
  • Design principles
  • Common pitfalls to avoid
  • Metrics to track

An honest experiment with clear success/failure criteria:

  • Null hypothesis stated upfront
  • Quantitative measurements
  • We'll report results regardless of outcome

What's Next: Part 2

After the experiment completes (estimated: 1-2 weeks), we'll publish Part 2 with:

Quantitative Results:

  • Actual time taken vs 8-10 day estimate
  • Merge conflicts encountered (we predict zero)
  • Code quality improvements (duplication, coverage, gaps)
  • Cross-PR exploitation value (number of integration tests)
  • Parallel efficiency calculation

Qualitative Insights:

  • What worked surprisingly well (unexpected synergies?)
  • What failed unexpectedly (hidden dependencies?)
  • Communication effectiveness (did dual-channel feedback work?)
  • Integration challenges (what did we miss?)
  • Whether CREATE + EXPLOIT + PROVE delivered value

Lessons Learned:

  • Best practices for parallel AI development
  • Task decomposition strategies (how granular is optimal?)
  • Metrics that mattered vs metrics that didn't
  • When to parallelize vs stay sequential
  • Would we do this again? What would we change?

The Code:

  • Before/after metrics comparison
  • FEEDBACK files showing communication
  • Complete metrics report

Questions for the Community

We'd love your input before we run this:

  1. Have you tried coordinating multiple AI agents? How did it go?
  2. What do you think will break? Where are the hidden dependencies?
  3. What metrics would you add? What are we missing?
  4. Would you try this? For what kind of projects?
  5. What's the optimal team size? We chose 5+1, would you do more/less?

Drop your thoughts in the comments!


Tags: #ai #claude #softwaredevelopment #testing #experiment #parallel #workflow #devops #automation


This is Part 1 of a 2-part series on parallel AI-assisted development. Part 2 will report results and lessons learned.

Top comments (1)

Collapse
 
ben profile image
Ben Halpern

whoa