TAKUYA HIRATA

Posted on Mar 27

I Gave 140 AI Agents a Constitution and a Kill Switch

#ai #python #programming #architecture

TL;DR

I built 140 AI agents on top of Claude Code, organized into 4 boards, 18 organizations, with a constitution, security halt authority, and autonomous decision-making. 367 tests, 135K lines of Python, and it actually runs. This article covers the design philosophy, technical choices, and spectacular failures — nothing held back.

Why 140 Agents?

Using ChatGPT or Claude as a single assistant, I noticed something: AI performs dramatically better as a team of specialists than as one generalist.

Security reviews, article writing, video production, code reviews, tax processing — cramming all of this into one prompt makes everything mediocre. Just like human organizations, specialization + governance is the key to quality.

Operator (human) — final authority
    |
Secretary (/ask) — intent parsing → routing
    |
4 Domain Boards — each Chairman owns strategic decisions
    |
    ├── App Board ──→ Product(10), Design(6), Operations(6), Security(6), LLM(8)
    ├── Game Board → Game Design(4), Engineering(3), Creative(3)
    ├── Content Board → Content(6), Revenue(11), Marketing(6), Creative(13), Education(7)
    └── Shared Board → Backoffice(7), Research(7), Oracle(3), Autonomous(8), User Testing(21)

Total: 135 org agents + 5 Holdings = 140 agents

This structure wasn't built to look impressive. Specialized agents consistently outperformed generalist ones — that's the evidence that led to this design.

Architecture: Protocol-Driven Composition

The AEGIS engine layer (98K LOC) uses Protocol + Composition + DI — a strategic choice to avoid inheritance hell.

Why Protocols?

# DON'T: Inheritance-based (leads to pain)
class BaseAgent(ABC):
    @abstractmethod
    def execute(self): ...

class SecurityAgent(BaseAgent):
    def execute(self): ...

class SecurityPentester(SecurityAgent):  # Multi-level inheritance → hell
    def execute(self): ...

# DO: Protocol + Composition (AEGIS actual pattern)
from typing import Protocol

class WorkflowEngineProtocol(Protocol):
    """Interface definition only — implementation is free"""
    def execute_workflow(self, workflow_id: str, context: dict) -> dict: ...
    def get_status(self, execution_id: str) -> dict: ...

class LangGraphEngine:
    """Protocol-compliant implementation A"""
    def execute_workflow(self, workflow_id, context):
        return self._langgraph_execute(workflow_id, context)

class NativeEngine:
    """Protocol-compliant implementation B — no LangGraph needed"""
    def execute_workflow(self, workflow_id, context):
        return self._native_execute(workflow_id, context)

# Injected via DI — switchable at runtime
container.register(WorkflowEngineProtocol, NativeEngine, Lifetime.SINGLETON)

The advantage: mocking is trivial in tests. External dependencies are abstracted behind Protocols, so you can test the entire pipeline without an LLM.

Three Layers, One Design Principle Each

Layer	LOC	Pattern	Design Principle
Engine	98K	Protocol + Composition + DI	Extensibility through abstraction
API	20K	FastAPI pragmatic monolith	Thin Router → Service → ORM
UI	12K	Vanilla ES6 functional modules	Named exports, module closures

One design principle per layer. This is critical. Initially, I tried applying DDD to every layer and failed. In practice, the API layer didn't need DDD — Pragmatic Layered Architecture was sufficient.

Composition Over Inheritance in Practice

Here's what composition looks like with 140 agents:

# Mixins for cross-cutting concerns
class SerializableMixin:
    def to_dict(self) -> dict:
        return dataclasses.asdict(self)

    @classmethod
    def from_dict(cls, data: dict):
        return cls(**data)

class CallbackMixin:
    def __init__(self):
        self._callbacks: dict[str, list] = {}

    def on(self, event: str, callback):
        self._callbacks.setdefault(event, []).append(callback)

    def emit(self, event: str, *args):
        for cb in self._callbacks.get(event, []):
            cb(*args)

# Components compose these behaviors
class BaseComponent(SerializableMixin, CallbackMixin):
    """Base for all pipeline components — no inheritance chain"""
    pass

No abstract base classes. No 5-level inheritance trees. Just composition of small, focused behaviors.

Governance: The Missing Layer in Every AI Framework

The 4-Level Decision Model

# Decision authority levels
OPERATIONAL: Agent decides automatically (status checks, log entries)
TACTICAL:    Org CEO decides (feature implementation, content publishing)
STRATEGIC:   Chairman + human CONFIRM (architecture changes, org restructuring)
CRITICAL:    Human HALT (security breach, data loss, credential exposure)

The most important rule: Security HALT — the Security org can stop every other org instantly. This overrides everything, including revenue priorities.

Auto-Approval System

Not every decision needs human input. The system classifies risk automatically:

# Auto-approval routing
approval_rules = {
    "status_check":    "AUTO",     # Low risk, reversible → execute silently
    "content_publish": "NOTIFY",   # Medium risk → execute, summarize daily
    "pricing_change":  "CONFIRM",  # High risk → require human approval
    "security_breach": "HALT",     # Critical → block everything immediately
}

This is what's missing from CrewAI, LangGraph, and AutoGen. When you have 140 agents, you need governance. Without it, agents propose conflicting strategies, make contradictory decisions, and nobody knows who has authority.

The Pipeline: 6-Stage Relay Processing

Every query passes through 6 organizations in sequence:

[Market Intelligence] → [Strategy (Go/No-Go)] → [Product] → [Technology] → [Execution] → [Validation]

Each stage is protected by an independent circuit breaker. If one breaks, the others keep running.

class StageCircuitBreaker:
    """Independent circuit breaker per pipeline stage"""
    def __init__(self, failure_threshold=3, cooldown=60):
        self.state = "closed"  # closed → open → half_open
        self.failure_count = 0

    def call(self, func, *args):
        if self.state == "open":
            if time.time() - self.last_failure > self.cooldown:
                self.state = "half_open"  # Allow retry
            else:
                raise CircuitOpenError(f"Circuit open for {self.stage}")
        try:
            result = func(*args)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

If the pipeline fails mid-way, you can resume from the last completed stage:

# First run (fails at Stage 3)
make org-pipeline QUERY="AI marketplace"
# → Stage 1 PASS, Stage 2 PASS, Stage 3 FAIL

# Resume from checkpoint
python3 orchestrator.py --resume run_20260327_143022
# → Stage 1 (skip), Stage 2 (skip), Stage 3 → 6 (re-run)

Writing 367 Tests in One Day with Parallel Agents

Nobody enjoys writing tests. But with 140 agents, running without tests is suicide.

The Problem

Early on, we had only 24 tests. A bug in pipeline_resilience.py would only surface when production broke.

The Solution: Parallel Agent Test Generation

Claude Code has an Agent tool for spawning parallel workers. I used it to write tests with 5 agents simultaneously:

Agent 1: Schema validation tests    → 82 tests (JSON parsing, all schemas)
Agent 2: Pipeline resilience tests   → 22 tests (circuit breaker, retry, health)
Agent 3: Stage 3-6 unit tests       → 45 tests (input/output per stage)
Agent 4: MCP tool tests             → 34 tests (all 8 MCP tools covered)
Agent 5: E2E integration tests      → 41 tests (full pipeline integration)

Result: 338 tests in one day. Subsequent features brought the total to 367.

Why Parallelism Works for Tests

Test files have minimal interdependencies. Agent 1 writing schema tests doesn't conflict with Agent 4 writing MCP tests. If I asked 5 agents to write the same article, they'd collide (same file). Match task structure to parallelism — that's the key insight.

Test execution results:
test_schemas.py          — 82 tests
test_pipeline_resilience — 22 tests
test_stages.py           — 45 tests
test_mcp_tools.py        — 34 tests
test_parallel_pipeline   — 14 tests
test_e2e_pipeline.py     — 41 tests
test_browser_agent       — 68 tests (SSRF defense, URL validation)
+ additional tests        — 61 tests
─────────────────────────
Total                     367 test methods
Full test run time: ~9 seconds

Adding Revenue Specialist Agents

AEGIS started with 136 agents but zero revenue. A technically beautiful system that can't sustain itself is a hobby, not a business.

The Root Problem

The generic revenue_ceo could talk strategy but didn't know platform-specific tactics. "Sell on Coconala" is useless advice without understanding the search algorithm, pricing norms, or review mechanics.

4 Revenue Specialist Agents

# Revenue org additions
coconala_specialist:
  focus: "Coconala listing optimization, pricing, search algorithm, review acquisition"
  # Coconala search ranks by: favorites × sales × reviews
  # Strategy: dump pricing initially to build track record

gumroad_specialist:
  focus: "Gumroad product design, 3-tier pricing, external traffic, email optimization"
  # Getting on Gumroad Discover = organic traffic
  # 3-tier pricing: $9.99 / $24.99 / $49.99

# Education org additions
menta_specialist:
  focus: "MENTA plan design, niche positioning, retention optimization"
  # Position: "Claude Code × solopreneur" — ultra-niche
  # Free consultation → monthly subscription conversion

udemy_specialist:
  focus: "Udemy course design, bestseller strategy, self-promotion 97% revenue"
  # Self-referral links keep 97% of revenue
  # Target niche keywords for search visibility

Key decision: don't add knowledge to a generalist — create separate specialists. Coconala's pricing strategy and Gumroad's pricing strategy are fundamentally different. Each platform has its own rules.

Agent Prompt Design: The 3-Layer Architecture

Shared Protocols (Injected into All 140 Agents)

# _shared_protocols.md (every agent gets this)
- Constitutional compliance: manifesto violation = HALT
- Confidence disclosure: 0.9+ → proceed, 0.7-0.89 → note uncertainty, <0.5 → don't present as fact
- Anti-hallucination: verify file existence before reference, cite sources for metrics
- Security: hardcoded secret = HALT

Individual Agent Specialization

# pentester.prompt (example)
## ETHICAL GUARDRAILS (absolute)
- Test only authorized systems
- No destructive actions — stop at vulnerability confirmation
- No DoS, no data exfiltration
- Include remediation for every finding

The 3-Layer Structure

Layer 1: _shared_protocols.md  (all agents — constitution)
Layer 2: org_agents.yaml       (org level — authority, constraints, KPIs)
Layer 3: <agent_name>.prompt   (individual — expertise, prohibitions)

Why 3 layers? I started with 6. Result: LLMs ignored 30% of the rules. The deeper the layer, the weaker the enforcement. 3 layers is the sweet spot.

CrewAI / LangGraph / AutoGen vs. AEGIS

"Why not just use CrewAI or AutoGen?" I tried all of them. Here's the honest comparison:

Aspect	CrewAI	LangGraph	AutoGen	AEGIS
Agent definition	Python class	Graph node	ConversableAgent	Markdown prompts
Governance	None	None	None	4-level decisions + constitution
Safety stop	None	None	None	Security HALT (immediate)
Scale ceiling	~10 agents	~20 nodes	~10 agents	140 (on-demand loading)
LLM cost	All cloud	All cloud	All cloud	90% local ($0)
Testability	Low	Medium	Low	High (Protocol abstraction)
Learning curve	Low	High	Medium	High

The Governance Gap

With 10 CrewAI agents, everyone speaks equally. There's no mechanism to stop a proposal with security risks. In AEGIS, security_ceo can halt all orgs instantly.

# CrewAI approach
crew = Crew(agents=[dev, reviewer, deployer])
crew.kickoff()  # → Who makes the final call? Security?

# AEGIS approach
# security_ceo issues HALT → all orgs stop → escalate to human
# 14-Day Revenue Rule < Security HALT (explicit priority ordering)

The Prompt Management Problem

In CrewAI and AutoGen, agent prompts live inside Python code. Managing 140 prompts inside Python files is hell. In AEGIS, every prompt is an independent .prompt file. Non-engineers can edit prompts too.

The Cost Problem

Other frameworks assume cloud LLM APIs. AEGIS defaults to Ollama (local LLM) and processes 90% of daily work at $0.

# LLM routing
OPERATIONAL: qwen2.5:14b (Ollama, local, $0)
TACTICAL:    Claude Sonnet 4.6 (cloud)
STRATEGIC:   Claude Opus 4.6 (cloud)

Honest Conclusion

For 5 or fewer agents, CrewAI is enough. Low learning curve, quick results.

For 10+ agents with governance needs — existing frameworks fall short. You need custom design like AEGIS. This isn't "AEGIS is better" — it's "the problem scale is different."

Cost Optimization: Running at $0/month

# Adapter Pattern for gradual migration
LLM:     Ollama($0) → Claude API(paid) — switch via env var
DB:      SQLite($0) → PostgreSQL(paid) — swap adapter
Storage: Filesystem($0) → S3/R2(paid) — swap adapter
Cache:   In-memory($0) → Redis(paid) — swap adapter
TTS:     Edge TTS($0) → ElevenLabs(paid) — swap adapter

90% of daily operations run on local LLM. M1 Max 64GB handles it comfortably. Cloud is reserved for important decisions only.

# Minimal startup (no Docker required)
make dev-minimal
# → AEGIS OS pipeline runs on Ollama alone

Why This Matters for Solo Developers

If you're a solopreneur building AI tools, cloud API costs compound fast. At 140 agents making decisions throughout the day, even cheap models add up. The Adapter Pattern lets you start at $0 and upgrade selectively:

# Environment-based LLM switching
import os

LLM_PROVIDER = os.environ.get("LLM_PROVIDER", "ollama")

if LLM_PROVIDER == "ollama":
    client = OllamaClient(base_url="http://localhost:11434")
elif LLM_PROVIDER == "anthropic":
    client = AnthropicClient(api_key=os.environ.get("ANTHROPIC_API_KEY", ""))

Failures and Lessons

Failure 1: "Write Rules and They'll Follow" Is a Fantasy

I wrote a 200-page rulebook. Result: LLMs ignored 30% of the rules.

Fix: Reduced from 6 layers to 3. Emphasized only critical rules. "Don't do X" is more effective than "Do Y" for LLMs.

Failure 2: Too Many Agents

Initially, I believed "more agents = higher quality." Reality:

Communication overhead exploded
Context windows consumed just by loading configurations
40% of agents were never used

Fix: Instead of deleting unused agents, switched to on-demand loading. Only 20-30 agents run constantly. The rest sleep until needed.

Failure 3: Zero Revenue

30+ articles published. Gumroad products created. Zero sales.

Root cause: zero traffic. Great content that nobody reads doesn't sell.

Lesson: Content generation AI automates "writing" but can't automate "getting read." Distribution and marketing are the real bottlenecks.

Failure 4: Token Waste

Every conversation burned ~6,000 tokens just loading rules. Usable context was severely limited.

Fix: Compressed all config files by 69% (1,609 → 494 lines). Prompts compressed by 29% (22,262 → 15,814 lines). Zero information loss.

Failure 5: Docs vs. Reality Drift

Documentation said "146 agents" but reality was 136. User Testing org changed from 24 to 21 without updating docs.

Fix: Script that auto-counts from agents.yaml. Documentation-implementation mismatches are now auto-detected. Manual tracking always breaks.

By the Numbers

Metric	Value
Total agents	140 (135 org + 5 holdings)
Organizations	18 orgs across 4 boards
Python LOC	135,000+
Tests	367 methods
Test runtime	~9 seconds
Prompt files	139+ files
Context consumption	~2,000 tokens/conversation (after compression)
LLM cost (local)	$0/month
Revenue specialist agents	4 (coconala, gumroad, menta, udemy)

Tech Stack

Engine design:      Protocol + Composition + DI (Python)
Agent definitions:  Markdown prompts (.prompt) x 140
Pipeline:           Python (asyncio + ThreadPoolExecutor)
LLM:                Ollama (qwen2.5:14b) + Claude API (fallback)
Config:             YAML (ai_config.yaml, agents.yaml)
Tests:              pytest (367 tests, ~9s)
Search:             SearXNG (self-hosted)
MCP:                FastMCP (8 tools)
API:                FastAPI (20K LOC, 12 routers)
UI:                 Vanilla ES6 modules (12K LOC)
CI:                 GitHub Actions + detect-secrets + pip audit

How to Reproduce This

You don't need to build 140 agents from day one. Start with the pattern:

Step 1: Define Your First 3 Agents as Prompts

# prompts/code_reviewer.prompt
You are a senior code reviewer. Focus on:
- Security vulnerabilities (OWASP Top 10)
- Performance anti-patterns
- Maintainability concerns
Never approve code with hardcoded secrets.

# prompts/content_writer.prompt
You are a technical content writer. Focus on:
- Developer audience (practical, code-heavy)
- SEO-optimized titles and structure
- Include working code examples in every section
Never publish without proofreading for factual accuracy.

# prompts/security_auditor.prompt
You are a security auditor with HALT authority.
You can stop any deployment if you find:
- Hardcoded credentials
- SQL injection vectors
- Missing authentication checks
Your HALT overrides all other priorities.

Step 2: Add Protocol-Based Routing

from typing import Protocol

class AgentProtocol(Protocol):
    def process(self, query: str, context: dict) -> dict: ...

class PromptAgent:
    def __init__(self, prompt_path: str, llm_client):
        self.prompt = open(prompt_path).read()
        self.llm = llm_client

    def process(self, query: str, context: dict) -> dict:
        response = self.llm.complete(
            system=self.prompt,
            user=query,
            context=context
        )
        return {"agent": self.name, "result": response}

# Route by intent
router = {
    "review": PromptAgent("prompts/code_reviewer.prompt", llm),
    "write":  PromptAgent("prompts/content_writer.prompt", llm),
    "audit":  PromptAgent("prompts/security_auditor.prompt", llm),
}

Step 3: Add Governance When You Hit 10+ Agents

That's when you need decision levels, security halt, and auto-approval. Not before.

Repository Structure

aegis/
├── engine/                      # Engine (98K LOC)
│   ├── organizations/           # AEGIS OS pipeline (10K LOC)
│   ├── common/                  # DI, circuit breaker, protocols
│   └── external/                # External integrations
├── orgs/                        # 18 organizations × agent prompts
│   ├── revenue/                 # Revenue org (11 agents)
│   ├── security/                # Security org (6 agents)
│   └── ...
├── holdings/                    # Governance (Secretary, 4 Chairmen)
├── prompts/                     # Shared protocols
├── config/organizations/        # LLM configuration
├── services/api/                # FastAPI (20K LOC)
└── ui/public/                   # Vanilla ES6 UI (12K LOC)

# Minimal startup
make dev-minimal  # Ollama only, no Docker required

# Run the pipeline
make org-pipeline QUERY="AI agent marketplace"

# Run tests
python3 -m pytest engine/organizations/tests/ -v
# → 367 passed in ~9s

Key Takeaways

Specialization works for LLMs — A team of specialist AIs outperforms one generalist
Governance is mandatory — Uncontrolled AI agents will conflict and hallucinate without authority structures
Tests are everything — 367 tests are the only reason I dare touch the codebase
Cost can be zero — Local LLMs handle 90% of daily operations
Framework choice < Design principles — If CrewAI/AutoGen isn't enough, build custom. But for 5 agents, CrewAI is fine
Building isn't enough — The biggest failure was assuming "build it and they will come"
Platform specialists beat generalists — Coconala tactics and Udemy tactics are completely different

Get the Full Blueprints

If you want to implement a similar system, I've published comprehensive resources:

56 AI Agent Prompts Pack ($24.99) — Copy-paste ready prompts for code review, security audit, content writing, revenue optimization, and more. All battle-tested across 140 agents.
Full Design Blueprints (note.com) — Complete governance rules, prompt templates, cost optimization configs, and a reusable starter kit.

Questions or feedback? Drop a comment. I read every single one.

This article was routed through AEGIS Secretary to the Content Board for writing. The article itself is AEGIS output.