DEV Community

TAKUYA HIRATA
TAKUYA HIRATA

Posted on

I Gave 140 AI Agents a Constitution and a Kill Switch

TL;DR

I built 140 AI agents on top of Claude Code, organized into 4 boards, 18 organizations, with a constitution, security halt authority, and autonomous decision-making. 367 tests, 135K lines of Python, and it actually runs. This article covers the design philosophy, technical choices, and spectacular failures — nothing held back.


Why 140 Agents?

Using ChatGPT or Claude as a single assistant, I noticed something: AI performs dramatically better as a team of specialists than as one generalist.

Security reviews, article writing, video production, code reviews, tax processing — cramming all of this into one prompt makes everything mediocre. Just like human organizations, specialization + governance is the key to quality.

Operator (human) — final authority
    |
Secretary (/ask) — intent parsing → routing
    |
4 Domain Boards — each Chairman owns strategic decisions
    |
    ├── App Board ──→ Product(10), Design(6), Operations(6), Security(6), LLM(8)
    ├── Game Board → Game Design(4), Engineering(3), Creative(3)
    ├── Content Board → Content(6), Revenue(11), Marketing(6), Creative(13), Education(7)
    └── Shared Board → Backoffice(7), Research(7), Oracle(3), Autonomous(8), User Testing(21)

Total: 135 org agents + 5 Holdings = 140 agents
Enter fullscreen mode Exit fullscreen mode

This structure wasn't built to look impressive. Specialized agents consistently outperformed generalist ones — that's the evidence that led to this design.


Architecture: Protocol-Driven Composition

The AEGIS engine layer (98K LOC) uses Protocol + Composition + DI — a strategic choice to avoid inheritance hell.

Why Protocols?

# DON'T: Inheritance-based (leads to pain)
class BaseAgent(ABC):
    @abstractmethod
    def execute(self): ...

class SecurityAgent(BaseAgent):
    def execute(self): ...

class SecurityPentester(SecurityAgent):  # Multi-level inheritance → hell
    def execute(self): ...
Enter fullscreen mode Exit fullscreen mode
# DO: Protocol + Composition (AEGIS actual pattern)
from typing import Protocol

class WorkflowEngineProtocol(Protocol):
    """Interface definition only — implementation is free"""
    def execute_workflow(self, workflow_id: str, context: dict) -> dict: ...
    def get_status(self, execution_id: str) -> dict: ...

class LangGraphEngine:
    """Protocol-compliant implementation A"""
    def execute_workflow(self, workflow_id, context):
        return self._langgraph_execute(workflow_id, context)

class NativeEngine:
    """Protocol-compliant implementation B — no LangGraph needed"""
    def execute_workflow(self, workflow_id, context):
        return self._native_execute(workflow_id, context)

# Injected via DI — switchable at runtime
container.register(WorkflowEngineProtocol, NativeEngine, Lifetime.SINGLETON)
Enter fullscreen mode Exit fullscreen mode

The advantage: mocking is trivial in tests. External dependencies are abstracted behind Protocols, so you can test the entire pipeline without an LLM.

Three Layers, One Design Principle Each

Layer LOC Pattern Design Principle
Engine 98K Protocol + Composition + DI Extensibility through abstraction
API 20K FastAPI pragmatic monolith Thin Router → Service → ORM
UI 12K Vanilla ES6 functional modules Named exports, module closures

One design principle per layer. This is critical. Initially, I tried applying DDD to every layer and failed. In practice, the API layer didn't need DDD — Pragmatic Layered Architecture was sufficient.

Composition Over Inheritance in Practice

Here's what composition looks like with 140 agents:

# Mixins for cross-cutting concerns
class SerializableMixin:
    def to_dict(self) -> dict:
        return dataclasses.asdict(self)

    @classmethod
    def from_dict(cls, data: dict):
        return cls(**data)

class CallbackMixin:
    def __init__(self):
        self._callbacks: dict[str, list] = {}

    def on(self, event: str, callback):
        self._callbacks.setdefault(event, []).append(callback)

    def emit(self, event: str, *args):
        for cb in self._callbacks.get(event, []):
            cb(*args)

# Components compose these behaviors
class BaseComponent(SerializableMixin, CallbackMixin):
    """Base for all pipeline components — no inheritance chain"""
    pass
Enter fullscreen mode Exit fullscreen mode

No abstract base classes. No 5-level inheritance trees. Just composition of small, focused behaviors.


Governance: The Missing Layer in Every AI Framework

The 4-Level Decision Model

# Decision authority levels
OPERATIONAL: Agent decides automatically (status checks, log entries)
TACTICAL:    Org CEO decides (feature implementation, content publishing)
STRATEGIC:   Chairman + human CONFIRM (architecture changes, org restructuring)
CRITICAL:    Human HALT (security breach, data loss, credential exposure)
Enter fullscreen mode Exit fullscreen mode

The most important rule: Security HALT — the Security org can stop every other org instantly. This overrides everything, including revenue priorities.

Auto-Approval System

Not every decision needs human input. The system classifies risk automatically:

# Auto-approval routing
approval_rules = {
    "status_check":    "AUTO",     # Low risk, reversible → execute silently
    "content_publish": "NOTIFY",   # Medium risk → execute, summarize daily
    "pricing_change":  "CONFIRM",  # High risk → require human approval
    "security_breach": "HALT",     # Critical → block everything immediately
}
Enter fullscreen mode Exit fullscreen mode

This is what's missing from CrewAI, LangGraph, and AutoGen. When you have 140 agents, you need governance. Without it, agents propose conflicting strategies, make contradictory decisions, and nobody knows who has authority.


The Pipeline: 6-Stage Relay Processing

Every query passes through 6 organizations in sequence:

[Market Intelligence] → [Strategy (Go/No-Go)] → [Product] → [Technology] → [Execution] → [Validation]
Enter fullscreen mode Exit fullscreen mode

Each stage is protected by an independent circuit breaker. If one breaks, the others keep running.

class StageCircuitBreaker:
    """Independent circuit breaker per pipeline stage"""
    def __init__(self, failure_threshold=3, cooldown=60):
        self.state = "closed"  # closed → open → half_open
        self.failure_count = 0

    def call(self, func, *args):
        if self.state == "open":
            if time.time() - self.last_failure > self.cooldown:
                self.state = "half_open"  # Allow retry
            else:
                raise CircuitOpenError(f"Circuit open for {self.stage}")
        try:
            result = func(*args)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise
Enter fullscreen mode Exit fullscreen mode

If the pipeline fails mid-way, you can resume from the last completed stage:

# First run (fails at Stage 3)
make org-pipeline QUERY="AI marketplace"
# → Stage 1 PASS, Stage 2 PASS, Stage 3 FAIL

# Resume from checkpoint
python3 orchestrator.py --resume run_20260327_143022
# → Stage 1 (skip), Stage 2 (skip), Stage 3 → 6 (re-run)
Enter fullscreen mode Exit fullscreen mode

Writing 367 Tests in One Day with Parallel Agents

Nobody enjoys writing tests. But with 140 agents, running without tests is suicide.

The Problem

Early on, we had only 24 tests. A bug in pipeline_resilience.py would only surface when production broke.

The Solution: Parallel Agent Test Generation

Claude Code has an Agent tool for spawning parallel workers. I used it to write tests with 5 agents simultaneously:

Agent 1: Schema validation tests    → 82 tests (JSON parsing, all schemas)
Agent 2: Pipeline resilience tests   → 22 tests (circuit breaker, retry, health)
Agent 3: Stage 3-6 unit tests       → 45 tests (input/output per stage)
Agent 4: MCP tool tests             → 34 tests (all 8 MCP tools covered)
Agent 5: E2E integration tests      → 41 tests (full pipeline integration)
Enter fullscreen mode Exit fullscreen mode

Result: 338 tests in one day. Subsequent features brought the total to 367.

Why Parallelism Works for Tests

Test files have minimal interdependencies. Agent 1 writing schema tests doesn't conflict with Agent 4 writing MCP tests. If I asked 5 agents to write the same article, they'd collide (same file). Match task structure to parallelism — that's the key insight.

Test execution results:
test_schemas.py          — 82 tests
test_pipeline_resilience — 22 tests
test_stages.py           — 45 tests
test_mcp_tools.py        — 34 tests
test_parallel_pipeline   — 14 tests
test_e2e_pipeline.py     — 41 tests
test_browser_agent       — 68 tests (SSRF defense, URL validation)
+ additional tests        — 61 tests
─────────────────────────
Total                     367 test methods
Full test run time: ~9 seconds
Enter fullscreen mode Exit fullscreen mode

Adding Revenue Specialist Agents

AEGIS started with 136 agents but zero revenue. A technically beautiful system that can't sustain itself is a hobby, not a business.

The Root Problem

The generic revenue_ceo could talk strategy but didn't know platform-specific tactics. "Sell on Coconala" is useless advice without understanding the search algorithm, pricing norms, or review mechanics.

4 Revenue Specialist Agents

# Revenue org additions
coconala_specialist:
  focus: "Coconala listing optimization, pricing, search algorithm, review acquisition"
  # Coconala search ranks by: favorites × sales × reviews
  # Strategy: dump pricing initially to build track record

gumroad_specialist:
  focus: "Gumroad product design, 3-tier pricing, external traffic, email optimization"
  # Getting on Gumroad Discover = organic traffic
  # 3-tier pricing: $9.99 / $24.99 / $49.99

# Education org additions
menta_specialist:
  focus: "MENTA plan design, niche positioning, retention optimization"
  # Position: "Claude Code × solopreneur" — ultra-niche
  # Free consultation → monthly subscription conversion

udemy_specialist:
  focus: "Udemy course design, bestseller strategy, self-promotion 97% revenue"
  # Self-referral links keep 97% of revenue
  # Target niche keywords for search visibility
Enter fullscreen mode Exit fullscreen mode

Key decision: don't add knowledge to a generalist — create separate specialists. Coconala's pricing strategy and Gumroad's pricing strategy are fundamentally different. Each platform has its own rules.


Agent Prompt Design: The 3-Layer Architecture

Shared Protocols (Injected into All 140 Agents)

# _shared_protocols.md (every agent gets this)
- Constitutional compliance: manifesto violation = HALT
- Confidence disclosure: 0.9+ → proceed, 0.7-0.89 → note uncertainty, <0.5 → don't present as fact
- Anti-hallucination: verify file existence before reference, cite sources for metrics
- Security: hardcoded secret = HALT
Enter fullscreen mode Exit fullscreen mode

Individual Agent Specialization

# pentester.prompt (example)
## ETHICAL GUARDRAILS (absolute)
- Test only authorized systems
- No destructive actions — stop at vulnerability confirmation
- No DoS, no data exfiltration
- Include remediation for every finding
Enter fullscreen mode Exit fullscreen mode

The 3-Layer Structure

Layer 1: _shared_protocols.md  (all agents — constitution)
Layer 2: org_agents.yaml       (org level — authority, constraints, KPIs)
Layer 3: <agent_name>.prompt   (individual — expertise, prohibitions)
Enter fullscreen mode Exit fullscreen mode

Why 3 layers? I started with 6. Result: LLMs ignored 30% of the rules. The deeper the layer, the weaker the enforcement. 3 layers is the sweet spot.


CrewAI / LangGraph / AutoGen vs. AEGIS

"Why not just use CrewAI or AutoGen?" I tried all of them. Here's the honest comparison:

Aspect CrewAI LangGraph AutoGen AEGIS
Agent definition Python class Graph node ConversableAgent Markdown prompts
Governance None None None 4-level decisions + constitution
Safety stop None None None Security HALT (immediate)
Scale ceiling ~10 agents ~20 nodes ~10 agents 140 (on-demand loading)
LLM cost All cloud All cloud All cloud 90% local ($0)
Testability Low Medium Low High (Protocol abstraction)
Learning curve Low High Medium High

The Governance Gap

With 10 CrewAI agents, everyone speaks equally. There's no mechanism to stop a proposal with security risks. In AEGIS, security_ceo can halt all orgs instantly.

# CrewAI approach
crew = Crew(agents=[dev, reviewer, deployer])
crew.kickoff()  # → Who makes the final call? Security?

# AEGIS approach
# security_ceo issues HALT → all orgs stop → escalate to human
# 14-Day Revenue Rule < Security HALT (explicit priority ordering)
Enter fullscreen mode Exit fullscreen mode

The Prompt Management Problem

In CrewAI and AutoGen, agent prompts live inside Python code. Managing 140 prompts inside Python files is hell. In AEGIS, every prompt is an independent .prompt file. Non-engineers can edit prompts too.

The Cost Problem

Other frameworks assume cloud LLM APIs. AEGIS defaults to Ollama (local LLM) and processes 90% of daily work at $0.

# LLM routing
OPERATIONAL: qwen2.5:14b (Ollama, local, $0)
TACTICAL:    Claude Sonnet 4.6 (cloud)
STRATEGIC:   Claude Opus 4.6 (cloud)
Enter fullscreen mode Exit fullscreen mode

Honest Conclusion

For 5 or fewer agents, CrewAI is enough. Low learning curve, quick results.

For 10+ agents with governance needs — existing frameworks fall short. You need custom design like AEGIS. This isn't "AEGIS is better" — it's "the problem scale is different."


Cost Optimization: Running at $0/month

# Adapter Pattern for gradual migration
LLM:     Ollama($0) → Claude API(paid) — switch via env var
DB:      SQLite($0) → PostgreSQL(paid) — swap adapter
Storage: Filesystem($0) → S3/R2(paid) — swap adapter
Cache:   In-memory($0) → Redis(paid) — swap adapter
TTS:     Edge TTS($0) → ElevenLabs(paid) — swap adapter
Enter fullscreen mode Exit fullscreen mode

90% of daily operations run on local LLM. M1 Max 64GB handles it comfortably. Cloud is reserved for important decisions only.

# Minimal startup (no Docker required)
make dev-minimal
# → AEGIS OS pipeline runs on Ollama alone
Enter fullscreen mode Exit fullscreen mode

Why This Matters for Solo Developers

If you're a solopreneur building AI tools, cloud API costs compound fast. At 140 agents making decisions throughout the day, even cheap models add up. The Adapter Pattern lets you start at $0 and upgrade selectively:

# Environment-based LLM switching
import os

LLM_PROVIDER = os.environ.get("LLM_PROVIDER", "ollama")

if LLM_PROVIDER == "ollama":
    client = OllamaClient(base_url="http://localhost:11434")
elif LLM_PROVIDER == "anthropic":
    client = AnthropicClient(api_key=os.environ.get("ANTHROPIC_API_KEY", ""))
Enter fullscreen mode Exit fullscreen mode

Failures and Lessons

Failure 1: "Write Rules and They'll Follow" Is a Fantasy

I wrote a 200-page rulebook. Result: LLMs ignored 30% of the rules.

Fix: Reduced from 6 layers to 3. Emphasized only critical rules. "Don't do X" is more effective than "Do Y" for LLMs.

Failure 2: Too Many Agents

Initially, I believed "more agents = higher quality." Reality:

  • Communication overhead exploded
  • Context windows consumed just by loading configurations
  • 40% of agents were never used

Fix: Instead of deleting unused agents, switched to on-demand loading. Only 20-30 agents run constantly. The rest sleep until needed.

Failure 3: Zero Revenue

30+ articles published. Gumroad products created. Zero sales.

Root cause: zero traffic. Great content that nobody reads doesn't sell.

Lesson: Content generation AI automates "writing" but can't automate "getting read." Distribution and marketing are the real bottlenecks.

Failure 4: Token Waste

Every conversation burned ~6,000 tokens just loading rules. Usable context was severely limited.

Fix: Compressed all config files by 69% (1,609 → 494 lines). Prompts compressed by 29% (22,262 → 15,814 lines). Zero information loss.

Failure 5: Docs vs. Reality Drift

Documentation said "146 agents" but reality was 136. User Testing org changed from 24 to 21 without updating docs.

Fix: Script that auto-counts from agents.yaml. Documentation-implementation mismatches are now auto-detected. Manual tracking always breaks.


By the Numbers

Metric Value
Total agents 140 (135 org + 5 holdings)
Organizations 18 orgs across 4 boards
Python LOC 135,000+
Tests 367 methods
Test runtime ~9 seconds
Prompt files 139+ files
Context consumption ~2,000 tokens/conversation (after compression)
LLM cost (local) $0/month
Revenue specialist agents 4 (coconala, gumroad, menta, udemy)

Tech Stack

Engine design:      Protocol + Composition + DI (Python)
Agent definitions:  Markdown prompts (.prompt) x 140
Pipeline:           Python (asyncio + ThreadPoolExecutor)
LLM:                Ollama (qwen2.5:14b) + Claude API (fallback)
Config:             YAML (ai_config.yaml, agents.yaml)
Tests:              pytest (367 tests, ~9s)
Search:             SearXNG (self-hosted)
MCP:                FastMCP (8 tools)
API:                FastAPI (20K LOC, 12 routers)
UI:                 Vanilla ES6 modules (12K LOC)
CI:                 GitHub Actions + detect-secrets + pip audit
Enter fullscreen mode Exit fullscreen mode

How to Reproduce This

You don't need to build 140 agents from day one. Start with the pattern:

Step 1: Define Your First 3 Agents as Prompts

# prompts/code_reviewer.prompt
You are a senior code reviewer. Focus on:
- Security vulnerabilities (OWASP Top 10)
- Performance anti-patterns
- Maintainability concerns
Never approve code with hardcoded secrets.
Enter fullscreen mode Exit fullscreen mode
# prompts/content_writer.prompt
You are a technical content writer. Focus on:
- Developer audience (practical, code-heavy)
- SEO-optimized titles and structure
- Include working code examples in every section
Never publish without proofreading for factual accuracy.
Enter fullscreen mode Exit fullscreen mode
# prompts/security_auditor.prompt
You are a security auditor with HALT authority.
You can stop any deployment if you find:
- Hardcoded credentials
- SQL injection vectors
- Missing authentication checks
Your HALT overrides all other priorities.
Enter fullscreen mode Exit fullscreen mode

Step 2: Add Protocol-Based Routing

from typing import Protocol

class AgentProtocol(Protocol):
    def process(self, query: str, context: dict) -> dict: ...

class PromptAgent:
    def __init__(self, prompt_path: str, llm_client):
        self.prompt = open(prompt_path).read()
        self.llm = llm_client

    def process(self, query: str, context: dict) -> dict:
        response = self.llm.complete(
            system=self.prompt,
            user=query,
            context=context
        )
        return {"agent": self.name, "result": response}

# Route by intent
router = {
    "review": PromptAgent("prompts/code_reviewer.prompt", llm),
    "write":  PromptAgent("prompts/content_writer.prompt", llm),
    "audit":  PromptAgent("prompts/security_auditor.prompt", llm),
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Add Governance When You Hit 10+ Agents

That's when you need decision levels, security halt, and auto-approval. Not before.

Repository Structure

aegis/
├── engine/                      # Engine (98K LOC)
│   ├── organizations/           # AEGIS OS pipeline (10K LOC)
│   ├── common/                  # DI, circuit breaker, protocols
│   └── external/                # External integrations
├── orgs/                        # 18 organizations × agent prompts
│   ├── revenue/                 # Revenue org (11 agents)
│   ├── security/                # Security org (6 agents)
│   └── ...
├── holdings/                    # Governance (Secretary, 4 Chairmen)
├── prompts/                     # Shared protocols
├── config/organizations/        # LLM configuration
├── services/api/                # FastAPI (20K LOC)
└── ui/public/                   # Vanilla ES6 UI (12K LOC)

# Minimal startup
make dev-minimal  # Ollama only, no Docker required

# Run the pipeline
make org-pipeline QUERY="AI agent marketplace"

# Run tests
python3 -m pytest engine/organizations/tests/ -v
# → 367 passed in ~9s
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

  1. Specialization works for LLMs — A team of specialist AIs outperforms one generalist
  2. Governance is mandatory — Uncontrolled AI agents will conflict and hallucinate without authority structures
  3. Tests are everything — 367 tests are the only reason I dare touch the codebase
  4. Cost can be zero — Local LLMs handle 90% of daily operations
  5. Framework choice < Design principles — If CrewAI/AutoGen isn't enough, build custom. But for 5 agents, CrewAI is fine
  6. Building isn't enough — The biggest failure was assuming "build it and they will come"
  7. Platform specialists beat generalists — Coconala tactics and Udemy tactics are completely different

Get the Full Blueprints

If you want to implement a similar system, I've published comprehensive resources:

Questions or feedback? Drop a comment. I read every single one.


This article was routed through AEGIS Secretary to the Content Board for writing. The article itself is AEGIS output.

Top comments (0)