claude-nexus-hyper-agent-team: We Built a 31-Agent AI Team That Hires Itself, Critiques Itself, and Dreams

Khaled Elazab — Wed, 22 Apr 2026 23:01:44 +0000

An honest engineering writeup of a self-evolving multi-agent system we built on top of Claude Code — complete with a parallel cognitive layer, dynamic hiring pipeline, and 341 passing structural contract tests. Source code is open — come break it.

[Architecture Overview — from user request through CTO orchestration, 6-tier specialization, non-skippable verification gates, Pattern F compounding loop, and Shadow Mind parallel cognition]

Diagram 1

The Problem With Single-Agent LLMs

Every week, another "AI agent" framework ships with breathless claims about autonomous reasoning. Most of them share the same shape: one LLM, a system prompt, maybe a tool-calling loop, and a marketing page that uses the word agentic four times.

The deeper you go, the more you realize what's missing. There's no specialization — one agent pretending to be five. No cross-verification — findings go unchallenged. No memory calibration — the system treats every agent's output as equally trustworthy. No self-improvement — prompts stay static until a human rewrites them. And above all, no team. Just a lone reasoner pretending otherwise.

We wanted something different. Not a bigger model. Not a more elaborate chain. A real team — specialists with distinct domains, trust calibrated by outcomes, a meta-cognitive layer that lets the system improve itself, and the ability to grow by hiring new specialists when it detects gaps in its own coverage.

After several months of iteration, I shipped a 31-agent system built on top of Claude Code — and in one recent session, taught it how to grow an unconscious mind that runs in parallel to the conscious team.

This post is my honest writeup of what I built, what works, what's still unproven, and what I learned about engineering cognition at this scale. Not a marketing piece. Not "look what my agent wrote for me." Real engineering discipline applied to LLM agents — and all the sharp edges that came with it.

GitHub: https://github.com/asiflow/claude-nexus-hyper-agent-team
Platform: Built using Claude Code + Claude Opus 4.7 (the 1M-context one)

Status: Open-source, 341/341 contract tests passing, publication-ready

The Architecture: 31 Agents Across 8 Tiers

At the surface, the team looks like a table of names. But the table is carefully structured, and the structure matters.

TIER 1 — BUILDERS (6):
  elite-engineer         Full-stack Go/Python/TS implementation
  ai-platform-architect  AI/ML systems, agent architecture, LLM infrastructure
  frontend-platform-eng  Frontend React/Next.js, streaming UX
  beam-architect         BEAM kernel — OTP/Horde/Ra, Rust NIFs via Rustler
  elixir-engineer        Elixir/Phoenix/LiveView on BEAM (pair-dispatched ee-1/ee-2)
  go-hybrid-engineer     Plane 2 Go edge + gRPC boundary

TIER 2 — GUARDIANS (11):
  go-expert, python-expert, typescript-expert — Language authorities
  deep-qa                Code quality, architecture drift
  deep-reviewer          Security, debugging, deployment safety
  infra-expert           K8s/GKE/Terraform/Istio
  database-expert        PostgreSQL/Redis/Firestore
  observability-expert   Logs/traces/metrics/SLO
  test-engineer          Test architecture + writes test code
  api-expert             GraphQL Federation, API design
  beam-sre               BEAM cluster operations on Kubernetes

TIER 3 — STRATEGISTS (2):
  deep-planner           Task decomposition, acceptance criteria
  orchestrator           Workflow supervision, gate enforcement

TIER 4 — INTELLIGENCE (6):
  memory-coordinator     Cross-agent memory synthesis
  cluster-awareness      Live GKE state via kubectl
  benchmark-agent        Competitive intelligence
  erlang-solutions-consultant  External BEAM advisory retainer
  talent-scout           Continuous team-coverage gap detection
  intuition-oracle       Shadow Mind query surface

TIER 5 — META-COGNITIVE (2):
  meta-agent             Prompt evolution, single-writer authority
  recruiter              8-phase hiring pipeline

TIER 6 — GOVERNANCE (1):
  session-sentinel       Protocol compliance enforcement

TIER 7 — CTO (1):
  cto                    Supreme technical authority

TIER 8 — VERIFICATION (2):
  evidence-validator     Claim verification against source
  challenger             Adversarial review of synthesis

Each agent has a single-page-to-several-pages prompt (ranging from 27 KB for focused consultants to 67 KB for the most senior architects). Each one has a dedicated memory directory. Each one is structurally contract-tested on every commit.

That last part matters more than you might think.

Innovation 1: Contract-Tested Agent Prompts

Most agent systems claim "we test our agents." Push on that and you'll usually find they mean "we ran the agents once and they didn't crash."

We wanted something harder.

Our contract test suite enforces 11 structural invariants on every agent prompt, run on every commit via a pre-commit hook:

Valid YAML frontmatter (name, description, model, color, memory)
Description length ≥ 100 chars with usage examples
Body length ≥ 500 chars
4-section closing protocol present (MEMORY HANDOFF, EVOLUTION SIGNAL, CROSS-AGENT FLAG, DISPATCH RECOMMENDATION)
NEXUS Protocol section documented
Team Coordination Discipline block present
AGENT TEAM INTELLIGENCE PROTOCOL v2 roster table
Persistent Agent Memory footer
Working Process or Output Protocol section
Self-Awareness & Learning Protocol section
Dispatch Mode Detection block

Across 31 agents × 11 contracts = 341 assertions, all passing on every merge.

This isn't "testing behavior" — we'll get to that limitation shortly. But it is a structural floor. A new agent can't join the team without passing the same shape tests as the incumbents. When someone adds a new capability, the contract tests catch accidental protocol skips before they merge.

This single discipline — structural contracts on prompts — does more to keep the system coherent than any amount of code review. Skip it and you get the usual agent-framework sprawl where every prompt drifts in a slightly different direction until nothing talks to anything.

Innovation 2: The NEXUS Syscall Protocol

[Dispatch lifecycle — full trace from user input through NEXUS syscalls, hook enforcement, evidence validation, challenger gating, to Pattern F drain]

Diagram 2

The biggest coordination problem in multi-agent systems is: how do specialists request privileged operations without becoming a security hole?

If every agent can spawn other agents, install tools, run background jobs, ask the user questions — you've created 31 independent actors that can invoke arbitrary capabilities. That's not a team; that's chaos.

Our answer is NEXUS: a syscall-style protocol where teammates emit structured requests via SendMessage, and the main thread (which we call the kernel) processes them.

# From inside a running agent:
SendMessage({
  to: "lead",
  message: "[NEXUS:SPAWN] elite-engineer | name=ee-sse-fix | prompt=Fix SSE bug",
})

# Main thread sees the [NEXUS:SPAWN] prefix and routes it:
Agent({
  subagent_type: "elite-engineer",
  name: "ee-sse-fix",
  team_name: "current-session",
  prompt: "Fix SSE bug"
})

# Response back to requesting agent:
SendMessage({to: "original-agent", message: "[NEXUS:OK] ee-sse-fix spawned"})

The syscall vocabulary is small and auditable:

SPAWN — create new teammate
SCALE — spawn N parallel instances
RELOAD — respawn agent with fresh prompt
MCP — install external capability
ASK — request user input
CRON — schedule recurring task
WORKTREE — create isolated git worktree
INTUIT — query the Shadow Mind (more on this below)
PERSIST — store durable cross-session data

Every call is auto-logged. Every high-risk call requires user confirmation. And crucially: agents have role-specific allowlists. A consulting-advisor agent can only use PERSIST and CAPABILITIES?. The oracle can only use those too. Builders get the full set. This role-specific syscall discipline isn't something I've seen in any other agent architecture — and it prevents a lot of the silent-capability-creep problems that plague flat agent systems.

Innovation 3: Dynamic Hiring — The Team That Grows Itself

Here's the problem this solves: you're working on an AWS migration, and the team doesn't have an AWS specialist. What happens?

In most agent frameworks, you get a generic engineer who vaguely knows AWS. Findings are shallow. Errors compound. You eventually realize you need a specialist, pause the work, manually research AWS best practices, write a new prompt, register it — hours of infrastructure work before you can continue.

We automated this.

talent-scout continuously watches five signals for coverage gaps:

Repo signature analysis — scans file extensions, Dockerfiles, deps, Terraform providers
Dispatch pattern analysis — counts fallbacks to generic agents in each domain
Trust-ledger anomalies — detects where existing agents produce low-confidence findings
External trend sensing — job postings, framework adoption, CVE clusters
User behavior patterns — repeated domain mentions without specialist engagement

Each signal has a weight. Confidence ≥ 0.90 AND session-sentinel co-sign? → Auto-initiate requisition. Below 90%? → Ask the user. Below 70%? → Watchlist.

recruiter takes the requisition and runs an 8-phase pipeline:

Parse requisition
Deep-research the domain (WebSearch + WebFetch with citation trail)
Mine scar-tissue from adjacent existing agents' memory
Synthesize a prompt matching AGENT_TEMPLATE.md
Run contract tests (3 iteration cap, then abort)
Route through challenger for adversarial review
Hand off to meta-agent for atomic registration
Track probation for 5 dispatches

At every step, discipline is enforced. recruiter never writes to the agent files directly — that's meta-agent's single-writer authority, preserved so we never have concurrent prompt edits. challenger attacks the proposal before it ships. The contract tests gate structural quality. And the probationary status in the trust ledger means the new hire has to earn its weight through real outcomes before it's treated as fully trusted.

The pipeline ran end-to-end for its first real hire on 2026-04-19: elixir-kernel-engineer, a third Plane 1 BEAM builder to absorb throughput during a platform Foundation window. The CTO agent adjudicated between two paths — Path A (scaling the existing elixir-engineer dyadic pair to count=3) vs Path B (a separate agent file with post-merge review) — using ten citations from the existing agent file to argue that Path A would break the dyadic pair-protocol's hardcoded assumptions. Path B was selected. Composite confidence was 0.365, below the standard 0.40 auto-initiate threshold — which correctly triggered the user-override path with a documented waiver rather than a silent bypass. The new agent is currently in probation: bootstrap trust 0.9, dispatch cap 5, first 3 dispatches gated pre-merge by beam-architect for trust calibration. Probation dispatches populate real verdicts as the Foundation window opens. Pipeline validation: complete. Operational trust calibration: in progress.

Innovation 4: The Shadow Mind — Parallel Non-Invasive Cognition

This is the piece I'm most proud of. I originally shipped it labeled "most experimental" — but after 5 weeks of continuous operation, the telemetry refutes the hedge, so I'm updating the framing to match what the data actually shows.

[Shadow Mind data flow — from live sessions through Observer, Pattern Computer, Speculator, Dreamer, to intuition-oracle INTUIT_RESPONSE v1]

Diagram 3

The problem: our team, for all its structure, still behaves like a cognitive compiler. Input comes in, agents reason, output goes out. Between dispatches, the team is effectively dead. There's no continuous thinking, no background pattern-matching, no "sleeping on it" the way human cognition works.

In biological systems, this is solved by having two cognitive layers that run in parallel. The conscious mind deliberates sequentially — slow, precise, explicit. The unconscious mind runs continuously in the background — fast, associative, pattern-matching, dream-generating. Either can be disabled without destroying the other. The unconscious whispers to the conscious; it never interrupts.

We built that.

Shadow Mind is a parallel cognitive layer with six components:

┌────────────────────────────────────────────────────────────────────┐
│           CONSCIOUS MIND (31-agent team, UNCHANGED)                 │
│   CTO → specialists → synthesis → output. Protocol-driven.          │
└────────────────────────────────────────────────────────────────────┘
                              ▲           │
                              │ whispers  │ observations
                              │           ▼
┌────────────────────────────────────────────────────────────────────┐
│           UNCONSCIOUS MIND (Shadow Mind, read-only)                 │
│                                                                     │
│   1. Observer Daemon     — tails signal bus, writes JSON logs       │
│   2. Pattern Computer    — derives n-grams, co-occurrences, temporal│
│   3. Pattern Library     — read-only substrate (populated by #2)    │
│   4. Speculator          — generates counterfactual variants        │
│   5. Dreamer             — proposes insights during long-idle       │
│   6. Intuition Oracle    — queryable surface, INTUIT_RESPONSE v1    │
└────────────────────────────────────────────────────────────────────┘

The critical property is disable-ability: the conscious layer has zero dependency on the unconscious layer. We verify this with a single test:

mv .claude/agent-memory/shadow-mind/ /tmp/
python3 tests/agents/run_contract_tests.py
# → 341 passed, 0 failed
mv /tmp/shadow-mind .claude/agent-memory/

If that test ever fails, we've accidentally coupled the layers and violated the architecture. It doesn't fail. The Shadow Mind can be removed entirely and the team keeps operating exactly as before.

How an agent consults the Shadow Mind

Any existing agent can emit an optional [NEXUS:INTUIT] syscall:

SendMessage({
  to: "lead",
  message: "[NEXUS:INTUIT] Has this auth middleware bug pattern appeared before?"
})

The oracle reads observations, patterns, and dreams, and responds within ~2 seconds with a structured envelope:

INTUIT_RESPONSE v1
intent: pattern-lookup
confidence: MEDIUM_CONFIDENCE
sample_size: 54
temporal_structure:
  last_7_days: 54
  last_30_days: 54
answer: |
  Pattern matched 3 times in last 90 days. Each resolved by
  dispatching go-expert before elite-engineer (P=1.0, count=3).
  The meta-recruiter → meta-talent sequence is the most
  deterministic transition in the current corpus.
top_matches:
  - case_id: dreams/2026-04-18-collaboration-gap-c4e378.yaml
    similarity: 0.88
    outcome: UNKNOWN (Dreamer proposal, review_status: pending)
caveats:
  - n-gram corpus is small: 54 observations, 8 sessions
  - Transitions with count=2 are at the floor
shadow_mind_freshness:
  observer_last_heartbeat: 0h ago
  staleness_flag: FRESH

Three things worth noting about this envelope:

Confidence is always explicit — HIGH / MEDIUM / LOW / INSUFFICIENT_DATA. The oracle never fabricates certainty from sparse data. If there's no match, it returns INSUFFICIENT_DATA honestly.
Caveats are structural — sample size, temporal structure, data-source lineage. A downstream parser can programmatically reason about reliability.
Staleness is first-class — if the Observer Daemon hasn't run in 24+ hours, the oracle returns SHADOW_MIND_STALE rather than serving stale patterns. Agents can gracefully fall back.

What the Shadow Mind already surfaced

On its first live activation, the Dreamer produced 27 insight candidates from just 54 observations. Three of them were genuinely useful:

A debug-loop detection: cto-1 emitted 5 evolution signals + 9 memory handoffs in one session with no resolution markers — a real cognitive loop that would have kept expanding without intervention.
A collaboration-gap detection: two agents that appear frequently but are never co-dispatched. Worth evaluating whether a joint-dispatch pattern would help.
A trust-drift flag: an agent that received 3 cross-agent flags in a short window. Below action threshold, but worth watching.

None of these patterns were coded explicitly. They emerged from the Observer's structured logs and the Dreamer's associative analysis. That's the unconscious layer doing exactly what I designed it to do.

Innovation Highlights — What's Genuinely Novel Here

If you're scanning for the "what's actually new?" section, this is it. These are the patterns I haven't seen in other agent frameworks, at least not together:

🎯 1. Contract-tested agent prompts

341 structural assertions (11 invariants × 31 agents) enforced on every commit via pre-commit hook. Prompt drift is a blocked state, not a future problem.

🛰️ 2. NEXUS syscall protocol with role-specific allowlists

Agents emit structured syscalls via SendMessage; the main thread is the kernel. The restrictive part: each agent has its own allowlist. Consultants get PERSIST + CAPABILITIES? only. The oracle gets the same. Builders get the full set. Per-role syscall discipline is something I haven't seen in any other agent system — and it's the cleanest way to prevent silent capability-creep.

📊 3. Trust ledger with Bayesian priors + lifecycle status

New hires start at probationary 0.9. They earn promotion to active through 5 successful dispatches with <25% refutation rate. Fail the bar → auto-proposal for retirement. Trust isn't vibes; it's calibrated by outcomes and tracked in a queryable JSON schema.

👥 4. Pair Protocol for paired dispatch

The elixir-engineer agent scales to ee-1 / ee-2 via [NEXUS:SCALE count=2] — and both instances peer-review each other's diffs before merge. It's pair programming as a dispatch pattern. Same prompt file, two runtime instances, mandatory mutual review gate.

🧠 5. Shadow Mind — parallel non-invasive cognition

Six-component unconscious layer that observes, learns patterns, speculates, and dreams — all without modifying any conscious-layer agent. Disable-ability is verified: mv shadow-mind/ /tmp/ → tests still pass 341/341. This is the architectural discipline most "extensible" systems can't prove.

🎓 6. Dynamic hiring pipeline

Team can detect its own coverage gaps (5-signal weighted scoring) and initiate hiring of new specialist agents through an 8-phase pipeline (requisition → research → synthesis → contract validation → adversarial review → atomic registration → probation → retirement). The team grows its own headcount.

⚔️ 7. Adversarial self-review

The challenger agent doesn't just attack other agents' outputs — it attacks the reviewer's reasoning. When I wrote an initial "honest caveats" section, challenger caught me in self-serving humility bias, corrected my evidence errors (my byte counts were off by 2.6×), and made me regrade the system from B+ to A-. The team stress-tests its own creator.

📜 8. Canonical signal-bus entry format as a contract

Every cross-agent finding, memory handoff, and evolution signal uses the same regex-parseable format (- (YYYY-MM-DD, agent=X, session=Y) content). Downstream parsers (Observer Daemon, Pattern Computer, oracle) depend on this format being stable. Drift is silent failure — so the format is codified as a contract in AGENT_TEMPLATE.md, not just a convention.

🔐 9. Single-writer invariant over agent prompts

Only meta-agent can write to .claude/agents/*.md. Not recruiter, not cto, not elite-engineer. This prevents concurrent prompt edits and makes prompt changes atomic + auditable. recruiter drafts new agents into a scratch directory and hands off to meta-agent for the actual registration.

🌐 10. Delete-to-disable architecture

Every advanced capability is independently removable:

Shadow Mind: rm -rf shadow-mind/ → team still works
Dynamic hiring: delete talent-scout.md + recruiter.md → team still works
Trust ledger: delete ledger.py → team still works (at reduced calibration)

Users adopt only what they want. Complexity is opt-in.

Use Cases — What This Team Actually Does

Concrete scenarios where the team is useful. Real dispatch patterns:

🔍 Use Case 1: Parallel multi-expert code review

user: "Review this PR — it touches Go backend + React frontend + Postgres migration"

→ cto dispatches in parallel:
    • go-expert    (reviews Go idioms, concurrency patterns)
    • typescript-expert (reviews React component tree, type safety)
    • database-expert  (reviews migration safety, rollback compatibility)
    • deep-reviewer    (security + deployment safety cross-cutting)

→ Each returns findings via signal bus
→ evidence-validator verifies HIGH-severity claims against source
→ challenger reviews cto's synthesis before surfacing to user
→ user receives consolidated review with per-agent trust-weighted findings

Time to full review: ~3 minutes parallel vs ~15 minutes serial with a single-agent approach.

🏗️ Use Case 2: BEAM architecture design (Living Platform)

user: "Design the Plane 1 OTP supervision topology for our per-session agent kernel"

→ cto routes to beam-architect (Tier 1 Builder)
→ beam-architect references apa-1 Wave 1 Option B topology from team memory
→ Produces 4-process SessionRoot design with Horde/Ra/pg cluster topology
→ Emits CROSS-AGENT FLAG to beam-sre for K8s deployment implications
→ Emits DISPATCH RECOMMENDATION: scale elixir-engineer to ee-1/ee-2 for implementation
→ Main thread dispatches [NEXUS:SCALE] elixir-engineer count=2 automatically

This is the kind of multi-specialist architectural dance that fails in single-agent frameworks.

🎯 Use Case 3: Automated specialist detection + hiring proposal

[Over 5 sessions, user has mentioned "AWS CDK" 8 times, and team has
 fallen back to generic elite-engineer each time because no AWS specialist exists]

→ talent-scout scans 5 signals (repo signature, dispatch patterns, trust-ledger 
  anomalies on AWS claims, external trends, user behavior patterns)
→ Computes confidence: 0.92
→ Emits [NEXUS:ASK session-sentinel] for co-sign
→ session-sentinel APPROVES
→ Drafts requisition YAML: "aws-cloud-engineer" role spec
→ Hands off to recruiter
→ recruiter runs 8-phase pipeline:
    researches AWS Well-Architected Framework (15+ source citations)
    synthesizes prompt matching AGENT_TEMPLATE.md
    runs contract tests (11/11 pass)
    routes through challenger (domain-overlap check with infra-expert)
    hands off to meta-agent
→ meta-agent atomically registers aws-cloud-engineer
→ New agent enters probation (0.9 trust weight, status=probationary)
→ After 5 dispatches at <25% refutation → auto-promoted to active

The team hired its own specialist. The user signed off on the gap; the rest was automated.

🔮 Use Case 4: Shadow Mind pattern-lookup for fast-path decisions

user: "Refactor the SSE buffering logic in the Go streaming service"

→ cto (inside teammate session) considers full Pattern A (plan→build→review→test→QA)
→ Before committing, emits [NEXUS:INTUIT] "Have we refactored SSE buffering before? 
   Which agents co-dispatched? What was the finding count?"

→ intuition-oracle (Shadow Mind) reads observations + patterns + dreams
→ Returns INTUIT_RESPONSE v1:
     confidence: MEDIUM
     sample_size: 12 similar refactors in corpus
     top_matches: 
       - go-expert → elite-engineer → test-engineer (P=0.83, past outcomes clean)
       - Average finding count: 2.1 HIGH + 4.5 MEDIUM
     caveats: corpus < 3 months old

→ cto adjusts Pattern A:
     dispatches go-expert first (pattern says they surface the key finding)
     skips deep-qa initially (adds noise without signal on SSE work)
     pre-warms test-engineer with SSE test matrix
→ Execution time: 40% faster than blind Pattern A

The Shadow Mind whispered. CTO listened. The team shipped faster.

🚨 Use Case 5: Debug-loop detection (catching stuck cognitive patterns)

[During a long session, cto-1 has emitted 5 evolution signals + 9 memory handoffs
 about the same architectural decision without resolution]

→ Dreamer (runs during idle windows via CronCreate)
→ Scans observations for unresolved signal clusters
→ Detects: "debug-loop — cto-1 × 14 signals on same topic, 0 resolution markers"
→ Writes dream candidate YAML to dreams/
→ proposed_to: meta-agent, review_status: pending
→ Next session, meta-agent reads dreams/ queue
→ Proposes to user: "Evidence of unresolved synthesis loop — consider dispatching 
   challenger or verifier to break it"
→ User approves → challenger runs adversarial review
→ Loop broken, decision closes

This is the unconscious mind catching a cognitive pattern the conscious mind couldn't see. It emerged from the observation data — no explicit rule coded.

🧪 Use Case 6: Production-grade prompt engineering without destroying anything

Every change to the team is:

Reversible — disable-ability invariant verified for all optional capabilities
Contract-tested — 341 structural assertions gate every commit
Trust-tracked — new patterns enter probationary status, earn trust through outcomes
Peer-reviewed — challenger attacks before shipping, evidence-validator verifies claims
Memory-persisted — every finding flows through signal bus into agent-specific memory

Meaning: I can experiment aggressively without fear of breaking the system. The architecture has opinions about what "safe to change" means, and it enforces them mechanically.

Validated by Real Outcomes

The system has now been running against a real production codebase for 5 weeks (2026-03-18 through 2026-04-21). What the data actually shows:

Signal	Observed
Trust-ledger verdicts	67 evidence-validator verdicts across 13 agents. go-expert: 15 CONFIRMED / 1 PARTIAL / 0 REFUTED (trust 0.952). deep-reviewer: 7 CONFIRMED / 3 PARTIAL / 1 REFUTED. 3 REFUTED verdicts total — proof the validator catches real mistakes, not rubber-stamps.
Challenger activity	Challenger received 10 real challenges to synthesize against, including 4 challenges to CTO's own recommendations. Adversarial review is genuinely adversarial.
Hiring pipeline	Ran end-to-end for its first real hire on 2026-04-19: `elixir-kernel-engineer`, with CTO Path A/B adjudication, documented-waiver on 0.365 composite confidence, and probation gates configured. See Innovation 3.
Shadow Mind telemetry	Observer daemon active (7,228 observations captured, fresh heartbeat). Pattern Computer derived 154 transitions across 35 sessions. Oracle queries returning structured MEDIUM/HIGH confidence on in-domain questions and correctly reporting `INSUFFICIENT_DATA` on under-observed domains. Three real oracle consultations shipped actionable findings that shaped challenger-gate scope.
Signal bus throughput	506 entries across 5 weeks (~100/week): 138 memory-handoffs, 126 NEXUS syscalls, 59 cross-agent flags, 35 evolution signals. Coordination is disciplined, not unbounded.
Domain breadth (within N=1)	~15 technical domains exercised with real memory depth: api-expert 524 KB, database-expert 260 KB, infra-expert 240 KB, ai-platform-architect 172 KB, cluster-awareness 168 KB, go-expert 132 KB, test-engineer 132 KB, devops-greenfield-engineer 124 KB, elite-engineer 84 KB, observability-expert 72 KB, typescript-expert 36 KB, frontend-platform-engineer 32 KB, security-engineer 24 KB, BEAM stack (3 agents, 52 KB combined), python-expert 12 KB.
Contract tests	341/341 passing on every commit (structural validation)

The thin spot in real data: python-expert has only 12 KB / 1 memory file, vs go-expert at 132 KB / 15 files. If you adopt this team for heavy Python work, expect the Python-specific behavioral calibration to be thinner than the Go equivalent until your sessions generate that data.

The Team, Running Live

Numbers in a table are one thing. Here's what the team actually looks like mid-session — 25+ teammates dispatched in parallel across domains, each with its own token budget, tool-use count, and live status:

You can see the dispatch taxonomy playing out in real names: @cto-v5-phase0 editing strategy, @challenger-dp-scaffold adversarially reviewing a scaffold proposal, @ev-iam1 through @ev-iam5 (five parallel evidence-validator instances verifying IAM claims), @memcoord-pattern-f-apr19 draining Pattern F into memory, @meta-5hire-register atomically registering the fifth hire of the day, @oracle-phase0-preflight (the Shadow Mind's intuition-oracle) doing a preflight check, @sentinel-session-end-ap1m closing out the session.

Every one of those agents is backed by a contract-tested prompt, a per-agent memory directory, and a trust-ledger entry. The naming convention (@<role>-<scope>-<date>) is how we keep 25 parallel teammates traceable — you can reconstruct which session, which role, and which work-item any finding came from just by reading the instance name.

And here's the dispatch surface from the main thread — how you actually hand work to the team:

The @main prefix addresses the kernel; everything after it is a teammate (or squad) receiving a directed message. This is the user-facing surface of the NEXUS protocol — the SendMessage layer compiled down to something a human can drive from a single prompt line.

The Honest Caveats

Now the part most blog posts skip.

The system is structurally rigorous, and after 5 weeks of production operation we have meaningful outcome data. But it's still one codebase. Sustained multi-team, multi-codebase behavior is unproven.

Specific known weaknesses at time of publication:

Cost profile is opus-heavy. 28/31 agents default to opus. A typical non-trivial session costs $5–20. Lower-cost adopters should expect to either negotiate committed-use pricing or fork a cost-conscious variant that runs most agents on sonnet.
Contract tests are structural, not a regression suite. 341/341 passing means every agent has the right sections, not that every agent produces the right findings. The 67 trust-ledger verdicts above are behavioral validation — but not a deterministic regression suite you can run in CI to catch prompt-drift. Closing that gap is a v2 priority.
N=1 by codebase, N=15 by domain. The team was developed and refined against a single production codebase. Multi-codebase, multi-language, multi-team behavior is unproven. The domain-breadth column above shows where the per-agent behavioral record is deep (api-expert, infra-expert) vs thin (python-expert, security-engineer). Adopters should expect sharper edges on thin-data domains until their sessions populate those agents' records.
Prompt tonnage is real. ~1.6 MB of agent prompts total, with 10 agents over 50 KB each (CTO at 103 KB). Arguably this is "distributed invariant insurance" (fault-isolation), arguably it's bloat. We lean toward the former, but it's measurable cost per dispatch. Prompt decomposition (core + lazy-loaded reference tables) is a v2 priority.
Coordination overhead is bounded but under-documented. 506 signal-bus entries over 5 weeks is disciplined, not pathological. But the triviality heuristic ("skip TeamCreate for trivial work") is under-specified. Expect some over-teaming on small tasks until dispatch-class taxonomy lands in v2.

We also have a verified self-review mechanism where the team's challenger agent attacks the reviewer's own reasoning. When I wrote an initial version of this post's "honest caveats" section, challenger caught me in a self-serving humility bias — inflating weakness counts to appear calibrated. The revision you're reading benefited from that stress test. Meta-cognition isn't free, but it's occasionally priceless.

Full architecture diagrams (editable Mermaid source + ASCII fallbacks): ARCHITECTURE_DIAGRAMS.md — adapt for your own slide decks or talks.

What's Under the Hood

Technical stack:

Model: Claude (via Claude Code)
Orchestration: Custom multi-agent coordination layer (NEXUS protocol) on top of Claude Code's subagent primitives
Language: Markdown for agent prompts, Python 3 for Shadow Mind scripts, Bash for hooks
Memory: File-based persistence (.claude/agent-memory/), ~298 memory files across 31 agents at time of writing
Trust Ledger: Python CLI with Bayesian-blended trust weighting, status field (probationary / active / retired), verdict + challenge history
Testing: Python test runner validating 11 structural contracts per agent
Hooks: Pre-commit (contract tests) + SubagentStop (protocol verification) + PostToolUse (NEXUS syscall logging) + optional post-hire-verify

Lines of code/prompts (approximate):

Agent prompts: ~1.3 MB across 31 files
Infrastructure (hooks, tests, ledger, Shadow Mind scripts): ~100 KB
Documentation (CLAUDE.md + docs/team/): ~80 KB
Template files: ~25 KB

Size metrics that matter more:

341 contract assertions passing
0 files with stale roster references
0 coupling between conscious and unconscious layers (verified)
3 independent write-authority invariants preserved (meta-agent over agents, memory-coordinator over cross-agent synthesis, cto over strategic arbitration)

Why This Works on Claude Specifically

Quick shout to the Claude Code team — this system isn't theoretically portable to other platforms, and I want to be specific about why.

Three Claude Code primitives that made this buildable:

Multi-agent primitives (subagents + SendMessage + TeamCreate) — Not every LLM platform has first-class support for this. Claude Code's team system with message-routing between named instances is what makes NEXUS implementable without bespoke orchestration code.
Long-context windows on Claude Opus 4.7 (1M) — Lets agents carry full memory briefs + 30-agent roster tables + capability domains without truncation. On shorter-context models, my entire architecture would collapse. I tried smaller models. They don't hold the role distinctions — 31 specialists become undifferentiated generalist soup inside ~50k tokens.
Tool sandboxing with hooks — Pre-commit, SubagentStop, PostToolUse hooks let me enforce contract-test and signal-persistence invariants mechanically, not just by prompt instruction. This is the difference between "we hope agents follow the protocol" and "agents that break the protocol get blocked at the commit." It's the single Claude Code primitive I'd lobby harder for other platforms to adopt.

Anthropic team, if you're reading: half-joke, half-serious invite — I'd love your engineering eyes on this. Specifically:

Does the NEXUS syscall pattern look sane from your perspective on Claude Code primitives, or am I torturing something?
Is there a cleaner way to enforce the single-writer invariant for agent prompts, or is my pattern roughly what you'd recommend?
The Shadow Mind's observer-daemon uses Monitor with persistent=true — is that the right primitive for long-lived background processes, or did I miss a better one?
The session-pinned subagent registry (new agents need restart to be dispatchable) — is that a known limitation or is there a refresh pattern I missed?

Any of those would be gold to hear opinions on. Come poke at the repo. PRs, roasts, "this is a terrible idea because X" — all land equally well.

What's Next

Priorities for the next phase:

Outcome tracking — currently we measure reasoning quality (via evidence-validator and challenger), but not downstream production outcomes. Closing that gap means tagging every deploy with a tracking ID and feeding real metrics back into the trust ledger.
First real hire — run talent-scout → recruiter → meta-agent end-to-end on a real coverage gap. This is the validation that converts the hiring pipeline from theoretical to proven.
Adaptive depth router — right now dispatching the CTO for a typo fix is overkill. A complexity-scoring router that picks minimum-viable-agent-set before dispatch would make the team genuinely usable in product contexts, not just engineering-internal ones.
Shadow Mind usage measurement — track how often [NEXUS:INTUIT] is consulted organically over the next 20 sessions. If it's zero, we learn something. If it's frequent, we learn something else.

The team can evolve toward all of these using its own infrastructure. meta-agent has the authority to propose prompt evolutions based on observed patterns. session-sentinel tracks protocol compliance over time. The trust ledger accumulates calibration data on every dispatch. The Shadow Mind's Dreamer already proposed one of these additions (outcome tracking) during its first run — we just haven't built it yet.

Try It Yourself

GitHub repo (Full version for complex engineering): https://github.com/asiflow/claude-nexus-hyper-agent-team

GitHub repo (Light version for cost optimization): https://github.com/asiflow/claude-nexus-hyper-agent-team-light

What you get if you clone:

31 agent files (~1.3 MB of production-calibrated prompts)
Full infrastructure (hooks, tests, trust ledger, Shadow Mind scripts)
Complete documentation (CLAUDE.md, TEAM_OVERVIEW, TEAM_RUNBOOK, TEAM_SCENARIOS, AGENT_TEMPLATE)
Passing contract test suite (341/341)
Verified disable-ability invariants

What you need:

Claude Code installed
A Claude API key (Anthropic)
Python 3 for scripts
Interest in multi-agent systems with actual engineering discipline

The repo is structured for direct installation: clone, copy the contents to your own .claude/ directory, run the contract tests to verify, and dispatch the cto agent to get started.

We'd love contributions, especially:

New specialist agent templates for domains we don't cover
Additional Shadow Mind scripts (e.g., a domain-specific Speculator variant)
Better benchmarks for agent-quality measurement (the gap we're honest about)
War stories from your own usage

Built by

Core builders

Name	Role	Background
Sherief Attia	CTO & Co-founder	Visionary AI entrepreneur and software architect with 20+ years leading and scaling $100M+ ventures in Renewable energy, IoT, and telecoms.
Khaled Elazab	Chief of AI Strategies & Co-founder	Technical Director and Senior Software Engineer with 5+ years leading teams across healthcare, education, real estate, and AI.
Hossam Hegazy	Chief of Engineering & Co-founder	Skilled AI systems engineer and software architect with a passion for building scalable multi-agent systems.

Sherief: LinkedIn / GitHub
Khaled: LinkedIn / GitHub
Hossam: LinkedIn

We're building ASIFlow — Building the future of Artificial General Intelligence with enterprise-grade reliability, security, and scale.

Join the ASIFlow waitlist: asiflow.ai/waitlist
Source1: https://github.com/asiflow/claude-nexus-hyper-agent-team
Source2: https://github.com/asiflow/claude-nexus-hyper-agent-team-light
Questions, contributions, critiques, hot takes: open an issue on GitHub or DM us on Linkedin

Feedback from the Anthropic team especially welcome. This system lives inside Claude Code and pushes several primitives — subagents, SendMessage, hooks, long context — harder than we've seen elsewhere. If something's architecturally off, we'd rather hear it from you than find out later. PRs and roasts equally welcome.

Tags: #AIAgents #ClaudeAI #MultiAgent #LLMEngineering #OpenSource #AgentArchitecture #ClaudeCode #Anthropic

DEV Community: Khaled Elazab

claude-nexus-hyper-agent-team: We Built a 31-Agent AI Team That Hires Itself, Critiques Itself, and Dreams

Why Most "AI Agents" Aren't Really Agents — And What We Built Instead

The Problem With Today's "Agent" Frameworks

What We Built

Four Practices That Moved the Needle

1. Enforce Protocol Below the Model, Not Inside It

2. Contract-Test Your Prompts on Every Commit

3. Separate Privileged Operations From Regular Agent Work

4. Calibrate Trust From Outcomes, Not From Prompt Claims

Honest About What's Not Proven

Try It Yourself

Feedback Welcome

claude-nexus-hyper-agent-team: We Built a 31-Agent AI Team That Hires Itself, Critiques Itself, and Dreams

The Problem With Single-Agent LLMs

Status: Open-source, 341/341 contract tests passing, publication-ready

The Architecture: 31 Agents Across 8 Tiers

Innovation 1: Contract-Tested Agent Prompts

Innovation 2: The NEXUS Syscall Protocol

Innovation 3: Dynamic Hiring — The Team That Grows Itself

Innovation 4: The Shadow Mind — Parallel Non-Invasive Cognition

How an agent consults the Shadow Mind

What the Shadow Mind already surfaced

Innovation Highlights — What's Genuinely Novel Here

🎯 1. Contract-tested agent prompts

🛰️ 2. NEXUS syscall protocol with role-specific allowlists

📊 3. Trust ledger with Bayesian priors + lifecycle status

👥 4. Pair Protocol for paired dispatch

🧠 5. Shadow Mind — parallel non-invasive cognition

🎓 6. Dynamic hiring pipeline

⚔️ 7. Adversarial self-review

📜 8. Canonical signal-bus entry format as a contract

🔐 9. Single-writer invariant over agent prompts

🌐 10. Delete-to-disable architecture

Use Cases — What This Team Actually Does

🔍 Use Case 1: Parallel multi-expert code review

🏗️ Use Case 2: BEAM architecture design (Living Platform)

🎯 Use Case 3: Automated specialist detection + hiring proposal

🔮 Use Case 4: Shadow Mind pattern-lookup for fast-path decisions

🚨 Use Case 5: Debug-loop detection (catching stuck cognitive patterns)

🧪 Use Case 6: Production-grade prompt engineering without destroying anything

Validated by Real Outcomes

The Team, Running Live

The Honest Caveats

What's Under the Hood

Why This Works on Claude Specifically

What's Next

Try It Yourself

Built by

Core builders