rokoss21

Posted on Jan 19

Swarm-IOSM: Orchestrating Parallel AI Agents with Quality Gates

#ai #claude #automation #engineering

TL;DR: Swarm-IOSM is an orchestration engine for Claude Code that transforms complex development tasks into coordinated parallel work streams. It implements continuous dispatch scheduling (no wave barriers), hierarchical file lock management, and enforces IOSM quality gates before merge. Real-world speedup: commonly 3-8x faster than sequential execution.

The Parallel Agent Problem

You're working on a complex feature. It needs:

Codebase analysis to understand existing patterns
Architecture design for the new system
Implementation across 3 modules (independent)
Integration tests
Security audit

Traditional approach: One agent does everything sequentially. 15 hours of wall-clock time.

What if you could run analysis, design, and implementation in parallel? 4-6 hours.

But here's the catch: parallel AI agents need coordination. They can't all edit the same file. They need to share knowledge. And you need quality guarantees before merging their work.

That's what Swarm-IOSM solves.

What is Swarm-IOSM?

Swarm-IOSM is a Claude Code Skill that orchestrates parallel AI agent execution with built-in quality enforcement. It combines:

Continuous Dispatch Loop — Tasks launch immediately when dependencies are met (no artificial wave barriers)
File Lock Management — Hierarchical conflict detection prevents parallel write chaos
PRD-Driven Planning — Structured requirements → decomposition → execution
IOSM Quality Gates — Automated code quality, performance, and modularity checks
Auto-Spawn Protocol — Agents discover new work during execution

Core Model

Touches → Locks → Gates → Done

A correctness model for parallel agent work:

Declare what files you touch
Acquire locks to prevent conflicts
Pass quality gates
Ship

Key Innovation: Continuous Dispatch

Traditional orchestration waits for entire "waves" to complete:

Wave 1: [T01, T02, T03] → Wait for ALL to finish
Wave 2: [T04, T05]      → Can't start until Wave 1 done

Swarm-IOSM uses continuous scheduling:

T01 done → T04 starts IMMEDIATELY (even if T02, T03 still running)

This eliminates idle time and maximizes parallelism. Here's the dispatch algorithm:

while not gates_met:
    # 1. Collect ready tasks (deps satisfied, no conflicts)
    ready = [t for t in backlog if deps_satisfied(t) and not conflicts(t)]

    # 2. Classify by mode (background vs foreground)
    bg = [t for t in ready if can_auto_background(t)]
    fg = [t for t in ready if needs_user_input(t)]

    # 3. Dispatch batch (max 3-6 tasks)
    launch_parallel(bg[:6], mode='background')
    launch_parallel(fg[:2], mode='foreground')

    # 4. Monitor & spawn
    for report in collect_completed():
        spawn_candidates = parse_spawn_candidates(report)
        backlog.extend(deduplicate(spawn_candidates))

    # 5. Check gates
    if all_gates_pass():
        break

Result: Tasks launch as soon as they're ready, not when an arbitrary wave completes.

Live Example: Adding Redis Caching

Let's walk through a real track from examples/demo-track/.

Problem

API endpoint /api/natal/chart has 450ms P95 latency. Database CPU at 75% during peak hours.

Goal

Add Redis caching to reduce latency to <200ms and achieve 80%+ cache hit rate.

Step 1: Create Track

/swarm-iosm new-track "Add Redis caching to API endpoints"

Claude generates:

PRD.md — 10 sections (Problem, Goals, Requirements, Risks, IOSM Targets)
spec.md — Technical design with acceptance tests
plan.md — Task breakdown with dependencies

Generated plan (7 tasks):

T01: Analyze current performance (Explorer, 1h, read-only)
T02: Design caching strategy (Architect, 2h, foreground)
T03: Implement cache service (Implementer-A, 3h, background)
T04: Add caching to /natal endpoint (Implementer-B, 2h, background, after T03)
T05: Add caching to /transits endpoint (Implementer-C, 2h, background, after T03)
T06: Integration testing (TestRunner, 2h, background, after T04+T05)
T07: Security audit + merge (Integrator, 1h, foreground, after T06)

Step 2: Execute Plan

/swarm-iosm implement

Orchestrator creates continuous_dispatch_plan.md:

## Initial Ready Set
- T01 (Explorer, background)

## Expected Timeline
Batch 1: T01 → completes in 1h
Batch 2: T02 → completes in 2h (total: 3h)
Batch 3: T03 → completes in 3h (total: 6h)
Batch 4: T04, T05 (PARALLEL) → completes in 2h (total: 8h)
Batch 5: T06 → completes in 2h (total: 10h)
Batch 6: T07 → completes in 1h (total: 11h)

Serial estimate: 13h
Parallel estimate: 11h
Speedup: ~1.2x

But wait — T01 discovers an N+1 query issue:

## SpawnCandidates (from T01 report)

| ID | Subtask | Touches | Effort | Severity |
|----|---------|---------|--------|----------|
| SC-01 | Optimize calculate_aspects N+1 query | `backend/core/astro/natal.py` | M | medium |

Orchestrator auto-spawns SC-01 and adjusts timeline.

Step 3: Integration & Quality Gates

/swarm-iosm integrate demo-add-caching

Generated iosm_report.md:

## Gate Evaluation Summary

| Gate | Target | Final | Status |
|------|--------|-------|--------|
| Gate-I (Code Quality) | ≥0.75 | 0.89 | ✅ PASS |
| Gate-O (Performance) | Tests pass | All pass | ✅ PASS |
| Gate-M (Modularity) | No circular deps | Pass | ✅ PASS |
| Gate-S (Simplicity) | API stable | N/A | ⚪ SKIP |

IOSM-Index: 0.85 ✅ (threshold: 0.80)

**Result:** APPROVED FOR PRODUCTION MERGE

Results

⚡ P95 latency: 450ms → 180ms (60% improvement)
🎯 Cache hit rate: 82%
✅ All tests passing (24 unit + 6 integration)
🔒 Zero production errors during rollout
⏱️ Total time: 9.25h parallel vs 16h+ sequential (~1.7x faster)

Technical Deep Dive

1. File Lock Management

Challenge: How do you prevent two agents from editing the same file simultaneously?

Solution: Hierarchical lock manager with folder/file awareness.

Lock rules:

def conflicts(lock_a: str, lock_b: str) -> bool:
    a, b = normalize(lock_a), normalize(lock_b)
    # Exact match
    if a == b:
        return True
    # Folder contains file
    if a.startswith(b + '/') or b.startswith(a + '/'):
        return True
    return False

Example:

## Lock Plan

Tasks with overlapping touches (sequential only):
- `backend/core/__init__.py`: T03, T04 → ❌ Cannot run parallel
- `backend/api/`: T05, T06 → ❌ Folder conflict

Safe parallel execution:
- `backend/auth.py` (T02) + `backend/payments.py` (T07) → ✅ No overlap

Read-only tasks: Always parallel (no locks needed).

2. IOSM Quality Gates

Four gates enforce production-grade quality:

Gate-I: Improve (Code Quality)

semantic_coherence: ≥0.95  # Clear naming, no magic numbers
duplication_max: ≤0.05     # Max 5% duplicate code
invariants_documented: true # Pre/post-conditions
todos_tracked: true        # All TODOs in issue tracker

Measured by: AST analysis, clone detection, docstring coverage.

Gate-O: Optimize (Performance & Resilience)

latency_ms:
  p50: ≤100
  p95: ≤200
  p99: ≤500
error_budget_respected: true
chaos_tests_pass: true
no_obvious_inefficiencies: true  # N+1 queries, memory leaks

Measured by: Load testing (locust, k6), chaos engineering, profiling.

Gate-M: Modularize (Clean Boundaries)

contracts_defined: 1.0       # 100% of modules
change_surface_max: 0.20     # ≤20% of codebase touched
no_circular_deps: true
coupling_acceptable: true

Measured by: Dependency graph analysis, interface stability.

Gate-S: Shrink (Minimal Complexity)

api_surface_reduction: ≥0.20  # Or justified growth
dependency_count_stable: true
onboarding_time_minutes: ≤15

Measured by: Public API count, requirements.txt diff, README clarity.

IOSM-Index Calculation:

IOSM-Index = (Gate-I + Gate-O + Gate-M + Gate-S) / 4
Production Threshold: ≥ 0.80

Auto-spawn rules:

Gate-I < 0.75 → Spawn clarity/duplication fixes
Gate-O fails → Spawn test/performance fixes
Gate-M fails → Spawn boundary clarification tasks

3. Auto-Spawn Protocol

Problem: Agents discover issues during execution (e.g., N+1 queries, missing tests).

Solution: Structured SpawnCandidates section in reports.

Format:

## SpawnCandidates

| ID | Subtask | Touches | Effort | User Input | Severity | Dedup Key | Accept Criteria |
|----|---------|---------|--------|------------|----------|-----------|-----------------|
| SC-01 | Fix missing type annotation | `backend/auth.py` | S | false | medium | auth.py\|type-annot | mypy passes |
| SC-02 | Clarify API contract | `docs/api_spec.yaml` | M | true | high | api_spec\|contract | Contract approved |

Orchestrator actions:

Parse SpawnCandidates from completed task reports
Deduplicate by dedup_key (prevents duplicate work)
If needs_user_input=false and severity != critical → auto-spawn
If needs_user_input=true → Add to blocked queue
Run new tasks through planner and dispatch

Spawn protection: Budget limits (default: 20 auto-spawns per track) prevent infinite loops.

4. Cost Tracking & Model Selection

Model selection rules:

Model	Use Case	Cost (per 1M tokens)
Haiku	Read-only analysis	$0.25 / $1.25
Sonnet	Standard implementation	$3.00 / $15.00
Opus	Architecture, security	$15.00 / $75.00

Budget controls:

Default limit: $10.00 per track
⚠️ 80% usage → Warning
🛑 100% usage → Pause execution

Check current spend:

## Cost Tracking (from iosm_state.md)
- budget_total: $10.00
- spent_so_far: $6.50
- remaining: $3.50

Real-World Use Cases

1. Greenfield Feature (Email Notifications)

Task: Add complete email notification system to SaaS app

Plan:

T01: Design email templates (Architect, foreground)
T02: Implement SMTP service (Implementer-A, background)
T03: Add queue system (Implementer-B, background, parallel with T02)
T04: Write integration tests (TestRunner, background, after T02+T03)
T05: Add API endpoints (Implementer-C, background, after T02)

Results:

⚡ ~3x faster (4-6h parallel vs 12-15h sequential)
✅ 100% test coverage (Gate-O enforcement)
📉 Minimal technical debt (Gate-I: 0.92)

2. Brownfield Refactoring (Payment Module)

Task: Refactor legacy payment processing (5000+ LOC, 3 years old)

Workflow:

Plan mode: Explorer analyzes codebase (read-only, safe)
PRD with rollback strategy
Comprehensive regression tests (before touching code)
Parallel implementation (2 modules refactored simultaneously)
Gate-M fails: Circular dependency detected
Auto-spawn: "Break circular import between Payment and Invoice"
Re-check Gate-M: Pass ✅

Results:

🎯 Gate-driven quality — Forced resolution of hidden issues
🔒 Safe refactor — All tests passing before merge
📊 Measured improvement — 40% reduction in module coupling

3. Multi-Module Feature (Multi-Tenant Architecture)

Task: Add multi-tenancy (affects 8 modules)

Plan: 20+ tasks across 5 waves

Wave 1: T01 Design schema (Architect, critical path)
Wave 2: T02-T04 Database migrations (3 parallel implementers)
Wave 3: T05-T10 Update 6 modules (6 parallel implementers)
Wave 4: T11-T15 Tests (5 parallel test runners)
Wave 5: T16 Integration

Auto-spawn: 3 critical tasks discovered during execution

Results:

📈 High parallelism — 6 modules updated simultaneously
💰 Budget control — $6.50 spent (within $10 limit)
⏱️ Time savings — ~18h parallel vs 60h+ sequential

Getting Started (5 Minutes)

Installation

# Clone into Claude Code skills directory
git clone https://github.com/rokoss21/swarm-iosm.git .claude/skills/swarm-iosm

Verify: type /swarm-iosm in Claude Code.

Create Your First Track

/swarm-iosm new-track "Add user authentication with JWT"

Claude will:

Ask questions (mode: greenfield/brownfield, priorities, constraints)
Generate PRD (10 sections)
Create plan.md with task breakdown
Show orchestration plan

Execute

/swarm-iosm implement

Watch the magic:

Parallel agents launch automatically
Progress tracked in iosm_state.md
Reports appear in reports/ directory

Integrate

/swarm-iosm integrate <track-id>

Quality gates run automatically. You get iosm_report.md with pass/fail.

Commands Reference

Command	Description
`/swarm-iosm setup`	Initialize project context
`/swarm-iosm new-track "<desc>"`	Create feature track
`/swarm-iosm implement`	Execute plan (auto mode)
`/swarm-iosm status`	Check progress
`/swarm-iosm watch`	Live monitoring (v1.3)
`/swarm-iosm simulate`	Dry-run with timeline (v1.3)
`/swarm-iosm resume`	Resume after crash (v1.3)
`/swarm-iosm retry <task-id>`	Retry failed task (v1.2)
`/swarm-iosm integrate <id>`	Merge and run gates

What Swarm-IOSM is NOT

To set clear expectations:

❌ Not a general-purpose workflow engine — Designed specifically for Claude Code agent orchestration
❌ Not a replacement for CI/CD — Complements your pipeline, doesn't replace it
❌ Not a code generator "autopilot" — Requires human oversight and decision-making
❌ Not safe to run unattended on production repos — Always review changes before merge

Architecture Overview

┌──────────────────────────────────────────────────────────────────────┐
│                    ORCHESTRATOR (Main Claude Agent)                  │
│  ┌─────────────────────────────────────────────────────────────────┐ │
│  │              Continuous Dispatch Loop (v1.1+)                   │ │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────────┐ │ │
│  │  │ Collect  │→ │ Classify │→ │ Conflict │→ │ Dispatch Batch   │ │ │
│  │  │  Ready   │  │  Modes   │  │  Check   │  │ (max 3-6 tasks)  │ │ │
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────────────┘ │ │
│  │       ↑                                           │             │ │
│  │       │        ┌──────────┐  ┌──────────┐         ↓             │ │
│  │       └────────│  IOSM    │←─│ Auto-    │←────────┘             │ │
│  │                │  Gates   │  │ Spawn    │                       │ │
│  │                └──────────┘  └──────────┘                       │ │
│  └─────────────────────────────────────────────────────────────────┘ │
│                                   │                                  │
│               ┌───────────────────┼───────────────────┐              │
│               ↓                   ↓                   ↓              │
│  ┌────────────────────┐ ┌────────────────────┐ ┌─────────────────┐   │
│  │   Subagent (BG)    │ │   Subagent (BG)    │ │  Subagent (FG)  │   │
│  │   Explorer         │ │   Implementer-A    │ │  Architect      │   │
│  │   read-only        │ │   write-local      │ │  needs_user     │   │
│  └────────────────────┘ └────────────────────┘ └─────────────────┘   │
│               │                   │                   │              │
│               ↓                   ↓                   ↓              │
│         reports/T01.md      reports/T02.md      reports/T03.md       │
│         + SpawnCandidates   + SpawnCandidates   + Escalations        │
└──────────────────────────────────────────────────────────────────────┘

IOSM Framework Integration

Swarm-IOSM implements the IOSM methodology (Improve → Optimize → Shrink → Modularize) as an executable system:

┌────────────────────────────────────────────────────────────────────────────┐
│                           IOSM FRAMEWORK                                   │
│                   https://github.com/rokoss21/IOSM                         │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────────────┐    │
│    │ IMPROVE  │ →  │ OPTIMIZE │ →  │  SHRINK  │ →  │   MODULARIZE     │    │
│    │          │    │          │    │          │    │                  │    │
│    │ Clarity  │    │ Speed    │    │ Simplify │    │ Decompose        │    │
│    │ No dups  │    │ Resil.   │    │ Surface  │    │ Contracts        │    │
│    │ Invars   │    │ Chaos    │    │ Deps     │    │ Coupling         │    │
│    └────┬─────┘    └────┬─────┘    └────┬─────┘    └────────┬─────────┘    │
│         │               │               │                   │              │
│    ┌────▼─────┐    ┌────▼─────┐    ┌────▼─────┐    ┌────────▼─────────┐    │
│    │ Gate-I   │    │ Gate-O   │    │ Gate-S   │    │     Gate-M       │    │
│    │ ≥0.85    │    │ ≥0.75    │    │ ≥0.80    │    │     ≥0.80        │    │
│    └──────────┘    └──────────┘    └──────────┘    └──────────────────┘    │
│                                                                            │
│    IOSM-Index = (Gate-I + Gate-O + Gate-S + Gate-M) / 4                    │
│    Production threshold: ≥ 0.80                                            │
└────────────────────────────────────────────────────────────────────────────┘

Version History

v2.1 (2026-01-19) — Current

Automated State Management (iosm_state.md auto-generated)
Status Sync CLI (--update-task)
Improved Report Conflict Detection

v2.0 (2026-01-18)

Inter-Agent Communication (shared_context.md)
Task Dependency Visualization (--graph)
Anti-Pattern Detection
Template Customization

v1.3 (2026-01-17)

Simulation Mode (/swarm-iosm simulate) with ASCII Timeline
Live Monitoring (/swarm-iosm watch)
Checkpointing & Resume (/swarm-iosm resume)

v1.2 (2026-01-16)

Concurrency Limits (Resource Budgets)
Cost Tracking & Model Selection (Haiku/Sonnet/Opus)
Intelligent Error Diagnosis & Retry (/swarm-iosm retry)

v1.1 (2026-01-15)

Continuous Dispatch Loop (no wave barriers)
Gate-Driven Continuation
Auto-Spawn from SpawnCandidates
Touches Lock Manager

Contributing

We welcome contributions! Key areas:

Gate Automation Scripts — Measure IOSM criteria automatically
CI/CD Integration — GitHub Actions, GitLab CI examples
Language-Specific Checkers — Python, TypeScript, Rust evaluators
More Examples — Real-world track demonstrations
IDE Integration — VS Code extension

See CONTRIBUTING.md for guidelines.

Conclusion

Swarm-IOSM proves that AI agent orchestration can be both fast (3-8x speedup through parallelism) and safe (quality gates before merge).

The continuous dispatch model eliminates artificial wave barriers, file lock management prevents conflicts, and IOSM gates enforce production-grade standards.

Key takeaway: Don't choose between speed and quality. With proper orchestration, you get both.

Try it today:

git clone https://github.com/rokoss21/swarm-iosm.git .claude/skills/swarm-iosm
/swarm-iosm new-track "Your next feature"

Top comments (1)

Art light • Feb 8

This is a really solid write-up — I like how you don’t just talk about parallel agents in theory, but actually show where coordination breaks down and how you fix it. The continuous dispatch idea + file locks feels like a very practical answer to the “AI chaos” problem a lot of teams will hit soon. I’m especially into the IOSM quality gates part, because speed without enforced standards usually turns into debt fast. The Redis example made it click for me — discovering new work mid-execution and adapting the plan is exactly how real engineering goes. Definitely curious how this behaves on larger teams or mixed human+AI workflows; feels like there’s a lot of potential here.

The Parallel Agent Problem

What is Swarm-IOSM?

Core Model

Key Innovation: Continuous Dispatch

Live Example: Adding Redis Caching

Problem

Goal

Step 1: Create Track

Step 2: Execute Plan

Step 3: Integration & Quality Gates

Results

Technical Deep Dive

1. File Lock Management

2. IOSM Quality Gates

Gate-I: Improve (Code Quality)

Gate-O: Optimize (Performance & Resilience)

Gate-M: Modularize (Clean Boundaries)

Gate-S: Shrink (Minimal Complexity)

3. Auto-Spawn Protocol

4. Cost Tracking & Model Selection

Real-World Use Cases

1. Greenfield Feature (Email Notifications)

2. Brownfield Refactoring (Payment Module)

3. Multi-Module Feature (Multi-Tenant Architecture)

Getting Started (5 Minutes)

Installation

Create Your First Track

Execute

Integrate

Commands Reference

What Swarm-IOSM is NOT

Architecture Overview

IOSM Framework Integration

Version History

v2.1 (2026-01-19) — Current

v2.0 (2026-01-18)

v1.3 (2026-01-17)

v1.2 (2026-01-16)

v1.1 (2026-01-15)

Contributing

Conclusion

Links