TL;DR
I built 140 AI agents on top of Claude Code, organized into 4 boards, 18 organizations, with a constitution, security halt authority, and autonomous decision-making. 367 tests, 135K lines of Python, and it actually runs. This article covers the design philosophy, technical choices, and spectacular failures — nothing held back.
Why 140 Agents?
Using ChatGPT or Claude as a single assistant, I noticed something: AI performs dramatically better as a team of specialists than as one generalist.
Security reviews, article writing, video production, code reviews, tax processing — cramming all of this into one prompt makes everything mediocre. Just like human organizations, specialization + governance is the key to quality.
Operator (human) — final authority
|
Secretary (/ask) — intent parsing → routing
|
4 Domain Boards — each Chairman owns strategic decisions
|
├── App Board ──→ Product(10), Design(6), Operations(6), Security(6), LLM(8)
├── Game Board → Game Design(4), Engineering(3), Creative(3)
├── Content Board → Content(6), Revenue(11), Marketing(6), Creative(13), Education(7)
└── Shared Board → Backoffice(7), Research(7), Oracle(3), Autonomous(8), User Testing(21)
Total: 135 org agents + 5 Holdings = 140 agents
This structure wasn't built to look impressive. Specialized agents consistently outperformed generalist ones — that's the evidence that led to this design.
Architecture: Protocol-Driven Composition
The AEGIS engine layer (98K LOC) uses Protocol + Composition + DI — a strategic choice to avoid inheritance hell.
Why Protocols?
# DON'T: Inheritance-based (leads to pain)
class BaseAgent(ABC):
@abstractmethod
def execute(self): ...
class SecurityAgent(BaseAgent):
def execute(self): ...
class SecurityPentester(SecurityAgent): # Multi-level inheritance → hell
def execute(self): ...
# DO: Protocol + Composition (AEGIS actual pattern)
from typing import Protocol
class WorkflowEngineProtocol(Protocol):
"""Interface definition only — implementation is free"""
def execute_workflow(self, workflow_id: str, context: dict) -> dict: ...
def get_status(self, execution_id: str) -> dict: ...
class LangGraphEngine:
"""Protocol-compliant implementation A"""
def execute_workflow(self, workflow_id, context):
return self._langgraph_execute(workflow_id, context)
class NativeEngine:
"""Protocol-compliant implementation B — no LangGraph needed"""
def execute_workflow(self, workflow_id, context):
return self._native_execute(workflow_id, context)
# Injected via DI — switchable at runtime
container.register(WorkflowEngineProtocol, NativeEngine, Lifetime.SINGLETON)
The advantage: mocking is trivial in tests. External dependencies are abstracted behind Protocols, so you can test the entire pipeline without an LLM.
Three Layers, One Design Principle Each
| Layer | LOC | Pattern | Design Principle |
|---|---|---|---|
| Engine | 98K | Protocol + Composition + DI | Extensibility through abstraction |
| API | 20K | FastAPI pragmatic monolith | Thin Router → Service → ORM |
| UI | 12K | Vanilla ES6 functional modules | Named exports, module closures |
One design principle per layer. This is critical. Initially, I tried applying DDD to every layer and failed. In practice, the API layer didn't need DDD — Pragmatic Layered Architecture was sufficient.
Composition Over Inheritance in Practice
Here's what composition looks like with 140 agents:
# Mixins for cross-cutting concerns
class SerializableMixin:
def to_dict(self) -> dict:
return dataclasses.asdict(self)
@classmethod
def from_dict(cls, data: dict):
return cls(**data)
class CallbackMixin:
def __init__(self):
self._callbacks: dict[str, list] = {}
def on(self, event: str, callback):
self._callbacks.setdefault(event, []).append(callback)
def emit(self, event: str, *args):
for cb in self._callbacks.get(event, []):
cb(*args)
# Components compose these behaviors
class BaseComponent(SerializableMixin, CallbackMixin):
"""Base for all pipeline components — no inheritance chain"""
pass
No abstract base classes. No 5-level inheritance trees. Just composition of small, focused behaviors.
Governance: The Missing Layer in Every AI Framework
The 4-Level Decision Model
# Decision authority levels
OPERATIONAL: Agent decides automatically (status checks, log entries)
TACTICAL: Org CEO decides (feature implementation, content publishing)
STRATEGIC: Chairman + human CONFIRM (architecture changes, org restructuring)
CRITICAL: Human HALT (security breach, data loss, credential exposure)
The most important rule: Security HALT — the Security org can stop every other org instantly. This overrides everything, including revenue priorities.
Auto-Approval System
Not every decision needs human input. The system classifies risk automatically:
# Auto-approval routing
approval_rules = {
"status_check": "AUTO", # Low risk, reversible → execute silently
"content_publish": "NOTIFY", # Medium risk → execute, summarize daily
"pricing_change": "CONFIRM", # High risk → require human approval
"security_breach": "HALT", # Critical → block everything immediately
}
This is what's missing from CrewAI, LangGraph, and AutoGen. When you have 140 agents, you need governance. Without it, agents propose conflicting strategies, make contradictory decisions, and nobody knows who has authority.
The Pipeline: 6-Stage Relay Processing
Every query passes through 6 organizations in sequence:
[Market Intelligence] → [Strategy (Go/No-Go)] → [Product] → [Technology] → [Execution] → [Validation]
Each stage is protected by an independent circuit breaker. If one breaks, the others keep running.
class StageCircuitBreaker:
"""Independent circuit breaker per pipeline stage"""
def __init__(self, failure_threshold=3, cooldown=60):
self.state = "closed" # closed → open → half_open
self.failure_count = 0
def call(self, func, *args):
if self.state == "open":
if time.time() - self.last_failure > self.cooldown:
self.state = "half_open" # Allow retry
else:
raise CircuitOpenError(f"Circuit open for {self.stage}")
try:
result = func(*args)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
If the pipeline fails mid-way, you can resume from the last completed stage:
# First run (fails at Stage 3)
make org-pipeline QUERY="AI marketplace"
# → Stage 1 PASS, Stage 2 PASS, Stage 3 FAIL
# Resume from checkpoint
python3 orchestrator.py --resume run_20260327_143022
# → Stage 1 (skip), Stage 2 (skip), Stage 3 → 6 (re-run)
Writing 367 Tests in One Day with Parallel Agents
Nobody enjoys writing tests. But with 140 agents, running without tests is suicide.
The Problem
Early on, we had only 24 tests. A bug in pipeline_resilience.py would only surface when production broke.
The Solution: Parallel Agent Test Generation
Claude Code has an Agent tool for spawning parallel workers. I used it to write tests with 5 agents simultaneously:
Agent 1: Schema validation tests → 82 tests (JSON parsing, all schemas)
Agent 2: Pipeline resilience tests → 22 tests (circuit breaker, retry, health)
Agent 3: Stage 3-6 unit tests → 45 tests (input/output per stage)
Agent 4: MCP tool tests → 34 tests (all 8 MCP tools covered)
Agent 5: E2E integration tests → 41 tests (full pipeline integration)
Result: 338 tests in one day. Subsequent features brought the total to 367.
Why Parallelism Works for Tests
Test files have minimal interdependencies. Agent 1 writing schema tests doesn't conflict with Agent 4 writing MCP tests. If I asked 5 agents to write the same article, they'd collide (same file). Match task structure to parallelism — that's the key insight.
Test execution results:
test_schemas.py — 82 tests
test_pipeline_resilience — 22 tests
test_stages.py — 45 tests
test_mcp_tools.py — 34 tests
test_parallel_pipeline — 14 tests
test_e2e_pipeline.py — 41 tests
test_browser_agent — 68 tests (SSRF defense, URL validation)
+ additional tests — 61 tests
─────────────────────────
Total 367 test methods
Full test run time: ~9 seconds
Adding Revenue Specialist Agents
AEGIS started with 136 agents but zero revenue. A technically beautiful system that can't sustain itself is a hobby, not a business.
The Root Problem
The generic revenue_ceo could talk strategy but didn't know platform-specific tactics. "Sell on Coconala" is useless advice without understanding the search algorithm, pricing norms, or review mechanics.
4 Revenue Specialist Agents
# Revenue org additions
coconala_specialist:
focus: "Coconala listing optimization, pricing, search algorithm, review acquisition"
# Coconala search ranks by: favorites × sales × reviews
# Strategy: dump pricing initially to build track record
gumroad_specialist:
focus: "Gumroad product design, 3-tier pricing, external traffic, email optimization"
# Getting on Gumroad Discover = organic traffic
# 3-tier pricing: $9.99 / $24.99 / $49.99
# Education org additions
menta_specialist:
focus: "MENTA plan design, niche positioning, retention optimization"
# Position: "Claude Code × solopreneur" — ultra-niche
# Free consultation → monthly subscription conversion
udemy_specialist:
focus: "Udemy course design, bestseller strategy, self-promotion 97% revenue"
# Self-referral links keep 97% of revenue
# Target niche keywords for search visibility
Key decision: don't add knowledge to a generalist — create separate specialists. Coconala's pricing strategy and Gumroad's pricing strategy are fundamentally different. Each platform has its own rules.
Agent Prompt Design: The 3-Layer Architecture
Shared Protocols (Injected into All 140 Agents)
# _shared_protocols.md (every agent gets this)
- Constitutional compliance: manifesto violation = HALT
- Confidence disclosure: 0.9+ → proceed, 0.7-0.89 → note uncertainty, <0.5 → don't present as fact
- Anti-hallucination: verify file existence before reference, cite sources for metrics
- Security: hardcoded secret = HALT
Individual Agent Specialization
# pentester.prompt (example)
## ETHICAL GUARDRAILS (absolute)
- Test only authorized systems
- No destructive actions — stop at vulnerability confirmation
- No DoS, no data exfiltration
- Include remediation for every finding
The 3-Layer Structure
Layer 1: _shared_protocols.md (all agents — constitution)
Layer 2: org_agents.yaml (org level — authority, constraints, KPIs)
Layer 3: <agent_name>.prompt (individual — expertise, prohibitions)
Why 3 layers? I started with 6. Result: LLMs ignored 30% of the rules. The deeper the layer, the weaker the enforcement. 3 layers is the sweet spot.
CrewAI / LangGraph / AutoGen vs. AEGIS
"Why not just use CrewAI or AutoGen?" I tried all of them. Here's the honest comparison:
| Aspect | CrewAI | LangGraph | AutoGen | AEGIS |
|---|---|---|---|---|
| Agent definition | Python class | Graph node | ConversableAgent | Markdown prompts |
| Governance | None | None | None | 4-level decisions + constitution |
| Safety stop | None | None | None | Security HALT (immediate) |
| Scale ceiling | ~10 agents | ~20 nodes | ~10 agents | 140 (on-demand loading) |
| LLM cost | All cloud | All cloud | All cloud | 90% local ($0) |
| Testability | Low | Medium | Low | High (Protocol abstraction) |
| Learning curve | Low | High | Medium | High |
The Governance Gap
With 10 CrewAI agents, everyone speaks equally. There's no mechanism to stop a proposal with security risks. In AEGIS, security_ceo can halt all orgs instantly.
# CrewAI approach
crew = Crew(agents=[dev, reviewer, deployer])
crew.kickoff() # → Who makes the final call? Security?
# AEGIS approach
# security_ceo issues HALT → all orgs stop → escalate to human
# 14-Day Revenue Rule < Security HALT (explicit priority ordering)
The Prompt Management Problem
In CrewAI and AutoGen, agent prompts live inside Python code. Managing 140 prompts inside Python files is hell. In AEGIS, every prompt is an independent .prompt file. Non-engineers can edit prompts too.
The Cost Problem
Other frameworks assume cloud LLM APIs. AEGIS defaults to Ollama (local LLM) and processes 90% of daily work at $0.
# LLM routing
OPERATIONAL: qwen2.5:14b (Ollama, local, $0)
TACTICAL: Claude Sonnet 4.6 (cloud)
STRATEGIC: Claude Opus 4.6 (cloud)
Honest Conclusion
For 5 or fewer agents, CrewAI is enough. Low learning curve, quick results.
For 10+ agents with governance needs — existing frameworks fall short. You need custom design like AEGIS. This isn't "AEGIS is better" — it's "the problem scale is different."
Cost Optimization: Running at $0/month
# Adapter Pattern for gradual migration
LLM: Ollama($0) → Claude API(paid) — switch via env var
DB: SQLite($0) → PostgreSQL(paid) — swap adapter
Storage: Filesystem($0) → S3/R2(paid) — swap adapter
Cache: In-memory($0) → Redis(paid) — swap adapter
TTS: Edge TTS($0) → ElevenLabs(paid) — swap adapter
90% of daily operations run on local LLM. M1 Max 64GB handles it comfortably. Cloud is reserved for important decisions only.
# Minimal startup (no Docker required)
make dev-minimal
# → AEGIS OS pipeline runs on Ollama alone
Why This Matters for Solo Developers
If you're a solopreneur building AI tools, cloud API costs compound fast. At 140 agents making decisions throughout the day, even cheap models add up. The Adapter Pattern lets you start at $0 and upgrade selectively:
# Environment-based LLM switching
import os
LLM_PROVIDER = os.environ.get("LLM_PROVIDER", "ollama")
if LLM_PROVIDER == "ollama":
client = OllamaClient(base_url="http://localhost:11434")
elif LLM_PROVIDER == "anthropic":
client = AnthropicClient(api_key=os.environ.get("ANTHROPIC_API_KEY", ""))
Failures and Lessons
Failure 1: "Write Rules and They'll Follow" Is a Fantasy
I wrote a 200-page rulebook. Result: LLMs ignored 30% of the rules.
Fix: Reduced from 6 layers to 3. Emphasized only critical rules. "Don't do X" is more effective than "Do Y" for LLMs.
Failure 2: Too Many Agents
Initially, I believed "more agents = higher quality." Reality:
- Communication overhead exploded
- Context windows consumed just by loading configurations
- 40% of agents were never used
Fix: Instead of deleting unused agents, switched to on-demand loading. Only 20-30 agents run constantly. The rest sleep until needed.
Failure 3: Zero Revenue
30+ articles published. Gumroad products created. Zero sales.
Root cause: zero traffic. Great content that nobody reads doesn't sell.
Lesson: Content generation AI automates "writing" but can't automate "getting read." Distribution and marketing are the real bottlenecks.
Failure 4: Token Waste
Every conversation burned ~6,000 tokens just loading rules. Usable context was severely limited.
Fix: Compressed all config files by 69% (1,609 → 494 lines). Prompts compressed by 29% (22,262 → 15,814 lines). Zero information loss.
Failure 5: Docs vs. Reality Drift
Documentation said "146 agents" but reality was 136. User Testing org changed from 24 to 21 without updating docs.
Fix: Script that auto-counts from agents.yaml. Documentation-implementation mismatches are now auto-detected. Manual tracking always breaks.
By the Numbers
| Metric | Value |
|---|---|
| Total agents | 140 (135 org + 5 holdings) |
| Organizations | 18 orgs across 4 boards |
| Python LOC | 135,000+ |
| Tests | 367 methods |
| Test runtime | ~9 seconds |
| Prompt files | 139+ files |
| Context consumption | ~2,000 tokens/conversation (after compression) |
| LLM cost (local) | $0/month |
| Revenue specialist agents | 4 (coconala, gumroad, menta, udemy) |
Tech Stack
Engine design: Protocol + Composition + DI (Python)
Agent definitions: Markdown prompts (.prompt) x 140
Pipeline: Python (asyncio + ThreadPoolExecutor)
LLM: Ollama (qwen2.5:14b) + Claude API (fallback)
Config: YAML (ai_config.yaml, agents.yaml)
Tests: pytest (367 tests, ~9s)
Search: SearXNG (self-hosted)
MCP: FastMCP (8 tools)
API: FastAPI (20K LOC, 12 routers)
UI: Vanilla ES6 modules (12K LOC)
CI: GitHub Actions + detect-secrets + pip audit
How to Reproduce This
You don't need to build 140 agents from day one. Start with the pattern:
Step 1: Define Your First 3 Agents as Prompts
# prompts/code_reviewer.prompt
You are a senior code reviewer. Focus on:
- Security vulnerabilities (OWASP Top 10)
- Performance anti-patterns
- Maintainability concerns
Never approve code with hardcoded secrets.
# prompts/content_writer.prompt
You are a technical content writer. Focus on:
- Developer audience (practical, code-heavy)
- SEO-optimized titles and structure
- Include working code examples in every section
Never publish without proofreading for factual accuracy.
# prompts/security_auditor.prompt
You are a security auditor with HALT authority.
You can stop any deployment if you find:
- Hardcoded credentials
- SQL injection vectors
- Missing authentication checks
Your HALT overrides all other priorities.
Step 2: Add Protocol-Based Routing
from typing import Protocol
class AgentProtocol(Protocol):
def process(self, query: str, context: dict) -> dict: ...
class PromptAgent:
def __init__(self, prompt_path: str, llm_client):
self.prompt = open(prompt_path).read()
self.llm = llm_client
def process(self, query: str, context: dict) -> dict:
response = self.llm.complete(
system=self.prompt,
user=query,
context=context
)
return {"agent": self.name, "result": response}
# Route by intent
router = {
"review": PromptAgent("prompts/code_reviewer.prompt", llm),
"write": PromptAgent("prompts/content_writer.prompt", llm),
"audit": PromptAgent("prompts/security_auditor.prompt", llm),
}
Step 3: Add Governance When You Hit 10+ Agents
That's when you need decision levels, security halt, and auto-approval. Not before.
Repository Structure
aegis/
├── engine/ # Engine (98K LOC)
│ ├── organizations/ # AEGIS OS pipeline (10K LOC)
│ ├── common/ # DI, circuit breaker, protocols
│ └── external/ # External integrations
├── orgs/ # 18 organizations × agent prompts
│ ├── revenue/ # Revenue org (11 agents)
│ ├── security/ # Security org (6 agents)
│ └── ...
├── holdings/ # Governance (Secretary, 4 Chairmen)
├── prompts/ # Shared protocols
├── config/organizations/ # LLM configuration
├── services/api/ # FastAPI (20K LOC)
└── ui/public/ # Vanilla ES6 UI (12K LOC)
# Minimal startup
make dev-minimal # Ollama only, no Docker required
# Run the pipeline
make org-pipeline QUERY="AI agent marketplace"
# Run tests
python3 -m pytest engine/organizations/tests/ -v
# → 367 passed in ~9s
Key Takeaways
- Specialization works for LLMs — A team of specialist AIs outperforms one generalist
- Governance is mandatory — Uncontrolled AI agents will conflict and hallucinate without authority structures
- Tests are everything — 367 tests are the only reason I dare touch the codebase
- Cost can be zero — Local LLMs handle 90% of daily operations
- Framework choice < Design principles — If CrewAI/AutoGen isn't enough, build custom. But for 5 agents, CrewAI is fine
- Building isn't enough — The biggest failure was assuming "build it and they will come"
- Platform specialists beat generalists — Coconala tactics and Udemy tactics are completely different
Get the Full Blueprints
If you want to implement a similar system, I've published comprehensive resources:
- 56 AI Agent Prompts Pack ($24.99) — Copy-paste ready prompts for code review, security audit, content writing, revenue optimization, and more. All battle-tested across 140 agents.
- Full Design Blueprints (note.com) — Complete governance rules, prompt templates, cost optimization configs, and a reusable starter kit.
Questions or feedback? Drop a comment. I read every single one.
This article was routed through AEGIS Secretary to the Content Board for writing. The article itself is AEGIS output.
Top comments (0)