An experiment in coordinating multiple AI coding agents to refactor a complex test suite
The Setup
You've been working on a complex system for months. It's non-trivial - maybe it's a real-time policy-driven scheduling system, maybe it's a distributed data pipeline, maybe it's a state machine with temporal dependencies. The kind of system where simple unit tests aren't enough.
Your test suite works, but after months of iteration:
- ✅ Good coverage (95%+)
- ✅ Tests pass reliably
- ⚠️ 200+ lines of duplicated test helpers
- ⚠️ Tests break frequently during refactoring (40+ times in git history)
- ⚠️ Some critical gaps in coverage
- ⚠️ Slow feedback loops (some logic only tested through expensive integration tests)
You need to improve the test suite. Estimated time: 8-10 days of focused work.
But then you wonder:
"What if I spawn 5 parallel Claude Code sessions, each working on independent improvements simultaneously?"
The Experiment
Hypothesis: Parallel execution of test improvements using multiple AI coding agents will:
- Reduce total time to 24-48 hours (vs 8-10 days sequential) - 70-85% reduction
- Maintain quality with zero merge conflicts (by design)
- Create emergent value through cross-PR integration
- Prove viability of autonomous AI agents working with clear specs
Null Hypothesis: Parallel execution provides no time benefit and introduces coordination complexity that negates any gains.
The setup:
- 5 parallel "worker" PRs, each improving a different aspect
- 1 "integration" PR that merges continuously and exploits combinations
- Each agent works independently with zero knowledge of others (except what's in git)
- Comprehensive metrics tracking throughout
What we're really testing: Can AI agents work as independent team members if given clear, well-defined tasks?
The Design Innovations
Innovation 1: CREATE + EXPLOIT + PROVE Pattern
Early in planning, we realized a critical flaw: What if a PR creates a capability but never uses it? How would we know it works?
The naive approach:
PR-1: Create test builders
PR-2: Create budget tests
PR-3: Create property tests
...
PR-6 (later): Update everything to use new builders
Problem: We don't know if builders work until PR-6. That's not parallel - that's deferred integration.
Our solution: Every PR must CREATE + EXPLOIT + PROVE
| Phase | What It Means | Example |
|---|---|---|
| CREATE | Build the capability | Create test_builders.py
|
| EXPLOIT | Use it extensively in SAME PR | Update 15-20 test files immediately |
| PROVE | Demonstrate measurable value | Eliminate 200 lines, all tests pass |
This ensures:
- ✅ Capabilities validated immediately (not deferred)
- ✅ Integration issues surface during development (not later)
- ✅ Each PR can be reviewed independently
- ✅ Metrics prove value (not theoretical)
This is the key to true parallel development: Each stream completes end-to-end, not just one phase.
Innovation 2: Zero-Conflict Architecture
Challenge: How do 5 parallel PRs avoid conflicts?
Solution: Each PR "owns" its files - no shared modifications.
| Work Stream | What It Creates/Modifies | Conflict Risk |
|---|---|---|
| PR-1: Test Builders | Creates test_builders.py + updates unit tests |
✅ None |
| PR-2: Budget Tests | Creates test_budget_allocation.py
|
✅ None |
| PR-3: Segment Tests | Creates test_segment_duration.py
|
✅ None |
| PR-4: Property Tests | Creates test_properties.py
|
✅ None |
| PR-5: Placement Tests | Creates test_placement_changes.py
|
✅ None |
| PR-6: Integration | Creates cross-PR integration tests | ✅ None - runs after merges |
The key insight: Conflict-free by construction. No merge conflict resolution needed.
How to achieve this:
- Create new files for new capabilities (never modify same files)
- Partition existing files carefully (PR-1 updates unit tests, PR-5 updates integration tests)
- Defer shared changes to integration PR (final cleanup, cross-cutting concerns)
Innovation 3: Parallel Integration Conductor
Naive approach: Integration waits for all PRs to finish, then merges sequentially.
Problem: Integration creates no value, just combines others' work. Also, it's not truly parallel.
Our breakthrough: Integration runs in parallel from Hour 0 as:
- Active merger: Continuously watches all PRs, merges immediately when complete (1-4 hour latency)
- Cross-PR exploiter: Creates capabilities that require MULTIPLE PRs (emergent value)
- Feedback provider: Detects integration issues early, communicates back to workers
The Emergent Value: Cross-PR Exploitation
After PR-1 (builders) and PR-2 (budget tests) merge, integration creates:
# test_cross_pr_budget_with_builders.py
# This test can ONLY exist after PR-1 + PR-2 merge!
from test_builders import make_participant # From PR-1
from test_budget_allocation import verify_tier_priority # From PR-2
def test_tier_priority_with_elegant_setup():
"""Proves PR-1 + PR-2 integrate correctly"""
# PR-1's builders make setup elegant
participants = [
make_participant("user1", tier=1, urgency="critical"),
make_participant("user2", tier=2, urgency="normal"),
make_participant("user3", tier=3, urgency="overdue")
]
# PR-2's logic validates ordering
result = verify_tier_priority(participants)
# Proves: builders work + tier logic works + integration works
assert result.order == ["user1", "user2", "user3"]
Why this matters: This test validates integration. It couldn't be written in PR-1 (no tier logic yet) or PR-2 (no builders yet). It can ONLY exist in integration.
Cross-PR Exploitation Timeline
| Timeline | Merged PRs | Integration Creates | Value |
|---|---|---|---|
| Hour 4-8 | PR-1 | N/A (baseline) | Test builders available |
| Hour 8-12 | PR-1 + PR-2 | Budget tests using builders | Proves builders + budget logic work together |
| Hour 12-16 | PR-1 + PR-3 | Segment tests using builders | Proves builders + segment logic work together |
| Hour 16-20 | PR-1 + PR-4 | Property tests generate 1000+ scenarios with builders | Massive test coverage with elegant setup |
| Hour 20-24 | PR-1 + PR-5 | Placement tests using builders | Critical safety validation |
| Hour 24-48 | ALL |
test_full_integration.py using ALL 5 capabilities |
Full system validation |
The final integration test uses capabilities from all 5 PRs simultaneously. This is the proof that everything works together.
Innovation 4: Dual-Channel Feedback
Problem: If integration detects issues in PR-2, how does it communicate quickly?
Solution: Two communication channels
Channel 1: Detailed FEEDBACK File
# FEEDBACK-PR-2.md
## Summary
✅ Merged successfully at Hour 8
⚠️ Found 2 issues during cross-PR testing
## Issue 1: Type Error in Budget Tests
**File**: `test_budget_allocation.py:45`
**Error**: `TypeError: expected Enum, got str`
**Fix**: Import `Priority` enum, use `Priority.HIGH` instead of "high"
**Code**:
python
Current (line 45)
participant = make_participant("user1", priority="high") # ❌
Fixed
from models import Priority
participant = make_participant("user1", priority=Priority.HIGH) # ✅
## Issue 2: Missing Edge Case
**Test**: `test_tier_priority_boundary`
**Gap**: Doesn't test participant at EXACT threshold
**Recommendation**: Add test case for `staleness == THRESHOLD`
## Cross-PR Test Results
✅ test_budget_with_builders.py: 5/5 passing
❌ test_property_budget.py: 2/100 failing (edge case)
Channel 2: GitHub @mention for Immediacy
@claude
## PR-2 Integration Issues ⚠️
2 issues found during cross-PR testing. See detailed analysis in `FEEDBACK-PR-2.md`.
**Quick summary**:
1. Type error in line 45 (enum vs string) - Easy fix
2. Missing edge case test for threshold boundary
**Action needed**: Please fix within 4 hours if possible. Not blocking other PRs.
**Tests to run**: `pytest tests/test_budget_allocation.py -v`
Why dual-channel?
- FEEDBACK file: Complete technical details, code snippets, full context
- GitHub @mention: Immediate notification, quick summary, clear action items
This mimics real team communication: "Hey @teammate, urgent issue - check the detailed doc I wrote."
The Work Breakdown
PR-1: Test Builders (4-8 hours)
Problem: Test helpers duplicated across 15+ files (200+ lines of duplication)
CREATE: Centralized test_builders.py
# Semantic builders for domain concepts
def make_participant(user_id, placement, risk_level=0.5, **kwargs):
"""Create participant with sensible defaults"""
return ParticipantState(user_id=user_id, placement=placement, ...)
def make_result(user_id, classification, timestamp, **kwargs):
"""Create moderation result with context"""
return ModerationResult(user_id=user_id, classification=classification, ...)
EXPLOIT: Update 15-20 test files to use builders
PROVE:
- Lines eliminated: 180-200
- Test coverage: Maintained 95%+
- All tests passing
PR-2: Budget Allocation Unit Tests (6-12 hours)
Problem: Complex priority/budget logic only tested through slow integration tests
CREATE: Unit tests for core allocation logic (15+ tests)
EXPLOIT: Run in CI for fast feedback (milliseconds vs seconds)
PROVE:
- Fast feedback loop for critical business logic
- Catches regressions before expensive integration tests
- Documents the priority algorithm
PR-3: Segment Duration Tests (3-6 hours)
Problem: Cost optimization logic (grouping) not tested
CREATE: Tests for segment grouping behavior (8+ tests)
EXPLOIT: Validate cost savings in realistic scenarios
PROVE: Confirms grouping reduces API calls by 60%
PR-4: Property-Based Tests (6-12 hours)
Problem: Fixed test scenarios may miss edge cases
CREATE: Property tests using hypothesis library
from hypothesis import given, strategies as st
@given(
participants=st.lists(st.builds(Participant), min_size=10, max_size=100),
budget=st.integers(min_value=1, max_value=50)
)
def test_budget_never_exceeded(participants, budget):
"""Property: budget NEVER exceeded regardless of inputs"""
result = apply_budget(participants, budget)
assert len(result.selected) <= budget
EXPLOIT: Generate and test 1000+ random scenarios automatically
PROVE: Discovers edge cases that fixed tests miss
PR-5: Placement Change Tests (4-8 hours)
Problem: Critical gap - dynamic state changes not tested
CREATE: Tests for state change detection (10+ tests)
EXPLOIT: Validate against production scenarios
PROVE: Catches safety violations before they reach users
PR-6: Integration (24-48 hours PARALLEL)
Role: Continuous merger + cross-PR exploiter + feedback provider
Starts Hour 0 (not sequential at end!)
Activities:
- Hours 0-24: Watch all PR branches, merge immediately when ready
- Hours 8-48: Create cross-PR integration tests as capabilities combine
- Continuous: Provide dual-channel feedback on integration issues
- Hours 40-48: Final verification, metrics report, PR to main
Creates: 5-10 cross-PR integration tests that couldn't exist in individual PRs
The Recipe: How to Replicate This
Want to try this on your codebase? Here's the complete workflow:
Step 1: Deep Analysis (1-2 hours)
Prompt:
Analyze the test suite for [YOUR_SYSTEM] looking for:
1. Gaps in coverage (what's not tested?)
2. Overlaps and duplication (what's tested redundantly?)
3. Useful test harnesses that would help
4. Opportunities for parameterization
5. Library changes that improve testability
Create a comprehensive report with:
- Git history analysis (what broke and why?)
- Cross-file analysis (where's the duplication? quantify it)
- Prioritized recommendations by impact
Focus on MEANINGFUL IMPROVEMENTS only (not trivial line reductions).
Output: Comprehensive analysis document (ours: 1,231 lines) with:
- Quantified duplication ("107 instances across 15 files")
- Git history patterns ("tests broke 40+ times during schema changes")
- Specific recommendations with line numbers
- Prioritization by impact
Step 2: Parallel Execution Plan (2-4 hours)
Prompt:
Transform the analysis into an executable parallel plan:
1. MASTER DOCUMENT:
- Overall plan and timeline
- 5 parallel work streams (PRs) with clear names
- Why they can run in parallel (zero conflicts)
- Success criteria
2. INDIVIDUAL TASK DOCUMENTS (one per PR):
- What to CREATE (the capability)
- How to EXPLOIT it (use extensively in SAME PR)
- How to PROVE it worked (measurable metrics)
- Files to modify (ensure no conflicts with other PRs)
- Success criteria and metrics template
3. INTEGRATION DOCUMENT:
- How to merge continuously (don't wait for all PRs)
- What cross-PR exploitations to create (emergent value)
- How to provide feedback (dual-channel: file + @mention)
REQUIREMENTS:
- Each PR must CREATE + EXPLOIT + PROVE (not just create)
- Zero conflicts by design (each PR owns its files)
- Integration runs in parallel from Hour 0 (not sequential)
- Integration creates emergent value (combinations)
- Comprehensive metrics collection
Key refinement prompts (iterate on initial plan):
Refinement 1:
"Each PR should not only CREATE a capability but also EXPLOIT IT.
That way each PR can PROVE it did something useful and not buggy."
Refinement 2:
"The integration PR should work in PARALLEL to all other PRs
and actively merge their work. If there are problems it can write
feedback in a FEEDBACK.md file on that PR."
Refinement 3:
"Give the integration PR a job to integrate/exploit new capabilities
that couldn't be used across PRs. If it sees problems it should report
via FEEDBACK.md file AND post a comment mentioning @claude with
specific instructions."
Output:
- Master plan with zero-conflict design
- 5+ PR task documents with CREATE + EXPLOIT + PROVE structure
- Integration conductor document with cross-PR exploitation plan
- Metrics template
Step 3: Execution (24-48 hours)
Setup:
- Create master work ticket
- Create 5 sub-tickets (one per worker PR)
- Create 1 integration ticket
- Spawn 6 parallel Claude Code sessions
Key principle: Each AI agent works independently with zero knowledge of other sessions (except what's in git). This tests autonomous operation.
Each session gets:
- Task document (e.g.,
PR-1-TEST-BUILDERS.md) - Metrics template
- Clear success criteria
Integration session:
- Starts immediately (Hour 0)
- Watches all PR branches
- Merges within 1-4 hours of PR completion
- Creates cross-PR tests
- Provides dual-channel feedback
Step 4: Metrics Collection (Throughout)
Track everything:
Code Metrics:
- Lines added/removed/net change
- Files created/modified/deleted
- Duplication eliminated (before/after counts)
Test Metrics:
- Tests added/modified/removed
- Coverage before/after (maintain or improve)
- Execution time delta (faster feedback?)
Time Metrics:
- Start/end timestamps
- Duration breakdown (planning/coding/testing/debugging)
- Actual vs estimated time
- Parallel efficiency = (Sequential estimate) / (Actual time)
Quality Metrics:
- Gaps closed (critical/important/nice-to-have)
- Brittleness reduced (measure: tests broken during refactoring)
- Developer experience (qualitative assessment)
Experiment Metrics (the key ones):
- Total time: actual vs sequential estimate
- Merge conflicts: count and resolution time (target: zero)
- Cross-PR value: tests that couldn't exist in individual PRs
- Feedback effectiveness: response time, issue resolution rate
- Coordination overhead: integration time vs time saved
Step 5: Analysis (After completion)
Quantitative analysis:
- Did we hit the 70-85% time reduction target?
- Was conflict count actually zero?
- How much emergent value from cross-PR tests?
- What was parallel efficiency?
Qualitative analysis:
- What worked surprisingly well?
- What failed unexpectedly?
- Where were hidden dependencies?
- Was dual-channel feedback effective?
- Would we do this again? What would we change?
The Hypothesis Matrix
| Question | Measurement | Our Hypothesis | How We'll Know |
|---|---|---|---|
| Does parallel execution save time? | Total time vs sequential baseline | 24-48 hours vs 8-10 days (70-85% reduction) | Compare actual completion time |
| Does it introduce conflicts? | Merge conflict count | Zero conflicts by design | Count merge conflicts encountered |
| Does it maintain quality? | Coverage, duplication, gaps closed | Equal or better quality | Compare metrics before/after |
| Does cross-PR create value? | Number of integration tests | 5-10 tests that couldn't exist in individual PRs | Count cross-PR tests created |
| Is coordination overhead worth it? | Integration PR time vs time saved | Integration creates value beyond coordination | Compare integration effort to time saved |
Why This Matters
This experiment tests a fundamental question:
"Can we parallelize software development by giving independent AI agents independent, well-defined tasks?"
If it works, it suggests:
- Complex refactoring can be parallelized (not inherently sequential)
- AI agents can work autonomously with clear specs (minimal coordination)
- Coordination overhead can be minimized (detailed specs > real-time communication)
- Development velocity scales horizontally (more agents = faster completion)
- Emergent value from integration (whole > sum of parts)
If it doesn't work, we learn:
- Where parallel development breaks down (hidden dependencies, interface coupling)
- What conflicts are unavoidable (shared state, cross-cutting concerns)
- Required coordination overhead (feedback loops, iteration cycles)
- Whether complexity is worth speed (diminishing returns, coordination cost)
- Optimal task decomposition level (too granular vs too coarse)
Either way, we get data.
What Makes This Different
This isn't just "use AI to write code faster." This is:
A systematic workflow for parallel AI development:
- Conflict-free architecture (zero-conflict by design)
- CREATE + EXPLOIT + PROVE pattern (end-to-end validation)
- Parallel integration conductor (emergent value)
- Dual-channel feedback (fast communication)
- Comprehensive metrics (measure everything)
A reproducible recipe anyone can follow:
- Step-by-step prompts
- Design principles
- Common pitfalls to avoid
- Metrics to track
An honest experiment with clear success/failure criteria:
- Null hypothesis stated upfront
- Quantitative measurements
- We'll report results regardless of outcome
What's Next: Part 2
After the experiment completes (estimated: 1-2 weeks), we'll publish Part 2 with:
Quantitative Results:
- Actual time taken vs 8-10 day estimate
- Merge conflicts encountered (we predict zero)
- Code quality improvements (duplication, coverage, gaps)
- Cross-PR exploitation value (number of integration tests)
- Parallel efficiency calculation
Qualitative Insights:
- What worked surprisingly well (unexpected synergies?)
- What failed unexpectedly (hidden dependencies?)
- Communication effectiveness (did dual-channel feedback work?)
- Integration challenges (what did we miss?)
- Whether CREATE + EXPLOIT + PROVE delivered value
Lessons Learned:
- Best practices for parallel AI development
- Task decomposition strategies (how granular is optimal?)
- Metrics that mattered vs metrics that didn't
- When to parallelize vs stay sequential
- Would we do this again? What would we change?
The Code:
- Before/after metrics comparison
- FEEDBACK files showing communication
- Complete metrics report
Questions for the Community
We'd love your input before we run this:
- Have you tried coordinating multiple AI agents? How did it go?
- What do you think will break? Where are the hidden dependencies?
- What metrics would you add? What are we missing?
- Would you try this? For what kind of projects?
- What's the optimal team size? We chose 5+1, would you do more/less?
Drop your thoughts in the comments!
Tags: #ai #claude #softwaredevelopment #testing #experiment #parallel #workflow #devops #automation
This is Part 1 of a 2-part series on parallel AI-assisted development. Part 2 will report results and lessons learned.
Top comments (1)
whoa