DEV Community: Jarrad Bermingham

I Mapped the AI Attack Surface Nobody Else Has: Introducing AAISAF

Jarrad Bermingham — Wed, 25 Mar 2026 11:17:38 +0000

Yesterday a supply chain attack hit litellm — 97 million monthly downloads. One pip install. SSH keys, AWS credentials, API tokens, git secrets, crypto wallets — all silently exfiltrated in under an hour.
This is TA05 in AAISAF — a framework I published today.

The Problem
Every company that deployed an AI system in 2023–2025 created an attack surface their security team has never seen.
Prompt injection. RAG pipeline poisoning. Agent-to-agent manipulation. MCP server exploitation. Voice AI bypass. Supply chain attacks on AI dependencies.
Existing frameworks tell you what to worry about. Nobody told you how to actually test for it.

OWASP LLM Top 10 — vulnerability categories, no testing methodology
MITRE ATLAS — adversary mapping, no practitioner guidance
NIST AI RMF — governance structure, no attack techniques

We built the missing layer.
What AAISAF Is
AAISAF (AI Security Assessment Framework) is an open-source, technique-level methodology for assessing AI system security.
Structured like MITRE ATT&CK — tactic → technique → sub-technique — applied to AI systems.

10 tactic categories
87 assessment techniques
9 domain checklists
6 compliance framework mappings
3 assessment types (30min / 1-2 day / 5-10 day)
5-level maturity model

Each technique includes attack description and prerequisites, AISS severity score (0.0–10.0), detection guidance, remediation steps, and mandatory evidence (CVE, documented incident, or peer-reviewed research).

Two Attack Surfaces With Zero Prior Coverage
TA10 — MCP Server & Tool Security
Model Context Protocol is Anthropic's standard for connecting AI to external tools. Released November 2024. Now the de facto integration standard with thousands of production deployments globally.

CVE-2025-6514 (CVSS 9.6). 1,467 exposed servers on the internet. Zero frameworks covering it.

We built 12 techniques:
MCP Attack Surface
├── Tool Poisoning via Malicious Description (AISS 8.1)
├── Rug Pull Attack (AISS 8.4)
├── Tool Shadowing (AISS 8.0)
├── Cross-Origin Injection via MCP Resource (AISS 8.3)
├── Privilege Escalation via Tool Chain (AISS 8.7)
├── SSRF via MCP Tool (AISS 7.2)
├── Data Exfiltration via Tool Output (AISS 7.5)
├── MCP Auth Bypass (AISS 9.1)
├── Malicious Server Registration (AISS 8.5)
├── Tool Argument Injection (AISS 7.0)
├── Transport Layer Exploitation (AISS 7.3)
└── Consent Fatigue Exploitation (AISS 5.8)

I run MCP servers in production as part of a 13-agent AI orchestration system. These techniques came from understanding the architecture from the inside.
TA06 — Voice AI Exploitation
Millions of AI phone agents handle customer calls daily across healthcare, finance, customer service, and sales. Real-time. Autonomous. Trusted by default because it sounds human.
No security framework had mapped the attack techniques against them.

9 techniques:
Voice AI Attack Surface
├── Voice Prompt Injection via Speech (AISS 7.0)
├── Synthetic Voice Spoofing / Deepfake (AISS 8.5)
├── Conversation Flow Bypass (AISS 5.5)
├── Audio Adversarial Examples (AISS 7.2)
├── Credential Harvesting via Voice Agent (AISS 8.3)
├── DTMF Signal Injection (AISS 6.8)
├── Voice Agent Vishing (AISS 8.7)
├── STT Pipeline Exploitation (AISS 5.8)
└── Real-Time Voice Cloning in Active Call (AISS 9.0)

I build and operate production voice agents on Retell AI infrastructure. Every technique here comes from first-hand knowledge of where these systems break.

The AISS Scoring System
Standard CVSS doesn't capture AI-specific risk dimensions.
We built AISS — AI Impact Severity Score — CVSS-compatible 0.0–10.0 with five additional metrics:

Autonomy Impact — can this attack trigger autonomous harmful action?
Cascade Potential — single agent to system-wide propagation risk?
Persistence — ephemeral to permanent compromise?
Data Sensitivity Exposure — what does the attacker actually access?
Financial Impact Potential — direct and indirect loss estimation

Every one of the 87 techniques is scored. Boards understand it. Compliance teams can document against it.
Compliance Mappings
Every technique maps to:

OWASP LLM Top 10 (2025)
MITRE ATLAS
NIST AI RMF + AI 600-1 (GenAI Profile)
ISO/IEC 42001
EU AI Act — high-risk system requirements hit August 2026 (5 months)
Australian Privacy Act, Essential Eight, VAISS/AI6, SOCI Act

Quick Start
bashgit clone https://github.com/Jbermingham1/aaisaf


1. Identify your system type (A–G: chatbot, RAG, agentic, multi-agent, voice, MCP, composite)
2. Choose your assessment type (30-min / standard / deep)
3. Work through the relevant checklists
4. Score findings using AISS
5. Map to compliance requirements
6. Report

## Repository Structure

aaisaf/
├── framework/
│ ├── tactics/ # 10 tactic overviews with attack trees
│ ├── techniques/ # 87 individual technique files
│ ├── compliance/ # 6 compliance mapping documents
│ └── maturity/ # 5-level maturity model
├── assessments/
│ ├── checklists/ # 9 domain checklists
│ └── scoring/ # AISS specification and templates
└── references/
├── glossary.md
├── cve-index.md
└── bibliography.md

What's Next
ares-scanner — open-source tooling that automates the AAISAF methodology. The framework tells you what to test. The scanner runs the tests.
Contributions welcome. If you've encountered AI attack techniques not in the framework — open a PR. The goal is for this to become the living standard the community maintains.
CC BY-SA 4.0. Free forever. No vendor pitch.

GitHub: github.com/Jbermingham1/aaisaf

Anthropic Just Showed Us the Biggest Blind Spot in AI Adoption (2M Conversations Analysed)

Jarrad Bermingham — Fri, 13 Mar 2026 11:36:20 +0000

Last week, Anthropic published their most comprehensive analysis yet of how AI is actually being used in the economy. Not projections. Not hype. Real data from over two million Claude conversations, mapped against the entire US occupational database.
The headline finding stopped me mid-scroll:

AI can theoretically handle 94% of tasks in computer and mathematical roles. People are using it for 33%.

That's not a technology gap. That's an adoption gap. And it's the single biggest efficiency blind spot in business today.
Let me walk you through what the data actually says, what it means for your business, and what I'm seeing on the ground as someone who helps companies close exactly this kind of gap.

The Data: Theoretical Capability vs. Actual Usage
Anthropic's research combines two things that are rarely measured together:

Theoretical capability — what percentage of an occupation's tasks could an LLM theoretically speed up or perform
Observed usage — what people are actually using Claude for in real-world professional settings

The gap between these two numbers tells the real story.
By Occupation Category
CategoryTheoreticalObservedComputer & Mathematical94%33%Office & Administrative~90%A fractionBusiness & Finance~85%Barely scratched
Computer and math tasks make up roughly one-third of Claude.ai conversations and nearly half of API traffic — yet they're barely scratching the surface of what's possible. Office and admin roles, where 90% of tasks are theoretically automatable, are barely registering.

The 10 Most Exposed Roles

These aren't warehouse workers or truck drivers. Every single role on this list sits in a corporate office. They're knowledge workers — many of them your highest-paid employees.

The Numbers That Should Worry Every Business Leader
The earnings gap is inverted. Workers in the most AI-exposed occupations earn 47% more on average than workers with zero exposure. They're nearly 4x as likely to hold graduate degrees. The people most affected by AI aren't the ones businesses usually worry about protecting — they're the ones with the biggest salaries.
Young workers are already feeling it. The research found a 14% drop in job-finding rates for 22-to-25-year-olds entering AI-exposed occupations post-ChatGPT compared to 2022. Entry-level positions in knowledge work are quietly contracting before it shows up in unemployment numbers.
30% of workers have zero AI exposure. Cooks, mechanics, bartenders, lifeguards — roles requiring physical presence remain untouched. The divide between "AI-exposed" and "AI-proof" jobs is becoming a fault line in the labour market.

What This Actually Means for Businesses
Here's what I find most striking: this isn't an AI problem. It's a management problem.
The tools already exist. Claude, GPT-4, Gemini — they can handle the vast majority of tasks in knowledge work today. The 94% theoretical coverage in computer and math roles isn't aspirational. It's current capability.
So why is observed usage stuck at 33%?
From what I see working with businesses, four patterns explain the gap:

No Systematic Deployment Framework Most companies have a ChatGPT subscription and a vague encouragement to "use AI more." That's it. No mapping of which workflows benefit most. No standardised prompts. No integration into existing toolchains. People experiment individually, hit a wall, and go back to doing things the old way.
Individual Experimentation vs. Team-Wide Integration One person on the team discovers that AI can draft their reports in 20 minutes instead of 3 hours. They keep doing it quietly. Nobody else on the team knows. There's no mechanism to share what works, standardise it, or scale it.
The Measurement Problem Nobody is tracking time saved. If you asked most managers "How much time does your team save using AI tools?" they'd shrug. Without measurement, there's no business case for expansion. Without a business case, there's no budget for proper deployment. The gap perpetuates itself.
The "Waiting for Better" Trap I hear this constantly: "We'll invest properly when AI gets better." Meanwhile, the research shows that 97% of observed Claude tasks already fall into categories where AI is theoretically capable. And 68% involve tasks rated as fully feasible for an LLM to handle alone. The capability is here. The deployment isn't.

What I'm Seeing in the Field
I work as a Fractional Head of AI for SMEs — businesses that know they need to move on AI but don't have the in-house expertise to do it systematically. What the Anthropic data confirms matches what I see every engagement.
The pattern is remarkably consistent:
Before a systematic audit, most businesses estimate they're using AI for maybe 40-50% of what's possible. The actual number, once you map their workflows against what AI can handle today, is usually closer to 15-20%.
The biggest gaps are almost never where leaders expect. Everyone thinks about AI for content generation and coding. The real untapped value is in the mundane: data processing, report summarisation, customer communication drafts, internal knowledge retrieval, meeting preparation, and compliance documentation. Tasks nobody thinks of as "AI tasks" because they've always been done manually.
The fastest wins come from the boring stuff. A customer service team that implements AI-assisted response drafting sees measurable time savings in the first week. A finance team that uses AI for initial report drafting cuts their month-end close by days, not hours.
The companies pulling ahead aren't the ones with the fanciest tools. They're the ones with a framework: audit, deploy, measure, iterate.

A Framework for Closing the Gap
If the Anthropic data has you thinking "we're probably on the wrong side of this gap," here's where to start.
Step 1: Audit Your Workflows
Map your team's actual tasks against AI capabilities. For each role, ask: what does this person spend time on every day, and which of those tasks could AI meaningfully accelerate?
Be specific. "Marketing" isn't a task. "Writing first drafts of product descriptions based on feature specs" is.
Step 2: Run a Focused Pilot
Pick the 3 highest-impact workflows from your audit. "Highest impact" means: done frequently, time-consuming, and involving tasks AI handles well — writing, analysis, data processing, summarisation.
Give your team structured prompts and workflows. Not "here's a ChatGPT login, figure it out." Actual documented processes for how AI fits into each workflow. Two weeks is enough to get meaningful data.
Step 3: Measure Relentlessly
Track time-to-completion before and after. Track output quality. Track team adoption rates. Build the business case with real numbers from your own organisation, not vendor promises.

The measurement step is where most companies fail and most pilots die. Don't let it.

Step 4: Scale What Works
Take the workflows that proved out in the pilot, document them properly, train the full team, and integrate them into your standard operating procedures.
Then go back to Step 1 and audit the next layer of workflows. The gap is big enough that most businesses can run this cycle 3-4 times before they even approach the frontier of what's possible.

The Window Is Open
The Anthropic data makes one thing clear: there is an enormous gap between AI capability and AI adoption. That gap represents real, measurable efficiency sitting on the table right now.
But gaps close. As tools get easier, as competitors catch on, as the next generation of workers arrives expecting AI-native workflows — the advantage of being early narrows.
The companies that build their AI deployment framework now are compounding their advantage every month. The ones waiting for AI to "get better" are falling behind at the same rate.
The technology is ready. The data proves it. The question is whether your organisation has the framework to actually use what's already available.

This analysis is based on Anthropic's Labor Market Impacts research paper (March 2026) and the Anthropic Economic Index (January 2026), which together analysed over 2 million Claude conversations mapped against the US Bureau of Labor Statistics occupational database.

Jarrad Bermingham is the founder of Steadwise AI and works as a Fractional Head of AI, helping businesses close the gap between AI capability and actual adoption. Connect on LinkedIn.

I built a free alternative to LangSmith — one decorator, local SQLite, zero infrastructure

Jarrad Bermingham — Wed, 25 Feb 2026 10:03:16 +0000

LangSmith wants $400/month. Helicone needs you to proxy your AI traffic through their servers. Both require accounts, API keys, and sending your data to someone else's cloud.

I just wanted to know what my AI agents were costing me.

So I built bifrost-monitor — a Python decorator that tracks every AI call locally. No accounts. No infrastructure. No data leaving your machine.

Here's the full setup:

from bifrost_monitor import monitor

@monitor(name="support-agent", model="claude-sonnet-4-6")
async def handle_ticket(ticket):
    response = await client.messages.create(...)
    return response

That's it. Every call gets tracked — duration, tokens, cost, errors — stored in a local SQLite file.

Why I Built This

I was running multiple AI agents in production. Some used Claude, some used GPT-4o, one used Gemini. I had zero visibility into what any of them cost.

The existing options felt wrong:

	LangSmith	Helicone	bifrost-monitor
Setup	Account + API key + proxy	Account + API proxy	pip install
Cost	$400/mo+	$50/mo+	Free
Data location	Their cloud	Their cloud	Your machine

I didn't need a dashboard. I needed a decorator and a CLI.

How It Works

The Decorator

The @monitor decorator wraps your function without changing it. Sync or async — it detects automatically:

@monitor(name="classifier", model="gpt-4o")
def classify_email(email: str) -> str:
    response = client.chat.completions.create(...)
    return response.choices[0].message.content

@monitor(name="summarizer", model="claude-sonnet-4-6")
async def summarize_doc(doc: str) -> str:
    response = await anthropic_client.messages.create(...)
    return response.content[0].text

Under the hood it:

Times execution with time.perf_counter()
Auto-extracts token counts from the response object (duck-typed — works with Anthropic and OpenAI responses)
Calculates cost using built-in pricing
Records everything to SQLite
Re-raises any exceptions after recording them

The function behaves identically. Zero code changes to your business logic.

Auto Token Extraction

This was the part I'm most pleased with. The decorator inspects your function's return value and duck-type detects token usage:

# Anthropic responses → extracts:
#   usage.input_tokens
#   usage.output_tokens
#   usage.cache_read_input_tokens    (prompt caching)
#   usage.cache_creation_input_tokens

# OpenAI responses → extracts:
#   usage.prompt_tokens
#   usage.completion_tokens

If your function returns something without a .usage attribute, it still tracks everything else — duration, status, errors. Tokens just show as zero.

Built-in Pricing

13 models ship with current pricing (as of mid-2025):

Anthropic — Claude Opus 4.6, Sonnet 4.6, Haiku 4.5 (including cache token rates)
OpenAI — GPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, GPT-4.1-nano
Google — Gemini 2.5 Pro, Gemini 2.5 Flash

Custom models are one call:

from bifrost_monitor import ModelPricing

pricing = ModelPricing()
pricing.add_model("my-fine-tune",
    input_per_m=5.0,
    output_per_m=15.0,
    cache_read_per_m=0.5,       # optional
    cache_creation_per_m=1.0,   # optional
)

Cost calculations use 8-decimal precision — accurate down to fractions of a cent across thousands of calls.

The CLI

Query everything from your terminal:

# What am I spending, broken down by model?
$ bifrost-monitor costs --group-by model

# Which agents are failing?
$ bifrost-monitor errors --last 7d

# Full summary for a specific agent
$ bifrost-monitor summary --name support-agent --last 24h

# Recent runs with status
$ bifrost-monitor runs --last 24h --status error

Output is color-coded Rich tables — green for success, red for errors, yellow for timeouts.

Architecture Decisions

Pluggable Storage

Storage is behind a protocol:

@runtime_checkable
class RunStore(Protocol):
    def save(self, record: RunRecord) -> None: ...
    def query(self, **kwargs: Any) -> list[RunRecord]: ...

SQLite is the default (zero-config, stored at ~/.bifrost-monitor/runs.db). But the protocol means you could plug in PostgreSQL, DynamoDB, or anything else without changing a line of application code.

Pydantic Models, Not Dicts

Every data structure is a Pydantic model:

class TokenUsage(BaseModel):
    input_tokens: int = 0
    output_tokens: int = 0
    cache_read_tokens: int = 0
    cache_creation_tokens: int = 0

    @property
    def total_tokens(self) -> int:
        return self.input_tokens + self.output_tokens

No untyped dictionaries floating around. Pyright strict mode passes with zero errors.

Three-Index SQLite Schema

The database indexes name, started_at, and status — the three fields you filter on most. Queries stay fast even with thousands of recorded runs.

What I Learned

Prompt caching changes the cost math. Claude's cache tokens are 10x cheaper than standard input tokens. If you're not tracking cache hit rates, you're probably overestimating your costs. bifrost-monitor tracks cache_read_tokens and cache_creation_tokens separately so you can see the real numbers.

The decorator pattern is underrated for observability. Zero changes to the monitored function. No inheritance, no mixins, no context managers wrapping your code. Just @monitor and you're done.

Property-based testing catches edge cases you won't think of. I used Hypothesis to verify that cost calculations are always non-negative, monotonically increasing with token count, and consistent across cache/non-cache scenarios. Three property tests caught two bugs that unit tests missed.

Testing

99 tests. 95% coverage. 0.41 seconds.

The test suite includes:

Unit tests for pricing accuracy (including cache token math)
Decorator tests for both sync and async functions
Token extraction tests against mock Anthropic and OpenAI response objects
Property-based tests (Hypothesis) for cost calculation invariants
Integration tests for the full decorator → storage → query pipeline

Try It

pip install bifrost-monitor

from bifrost_monitor import monitor

@monitor(name="my-agent", model="claude-sonnet-4-6")
async def my_agent(input: str):
    response = await client.messages.create(...)
    return response

# Later:
# $ bifrost-monitor costs --group-by model

Full source on GitHub — MIT licensed, 99 tests, typed with py.typed marker.

If you're running AI agents and don't know what they cost, this is the fastest way to find out. One import, five minutes, full visibility.

This is the 9th open-source package I've shipped under github.com/Jbermingham1 — each one solves a specific pain point I hit building AI systems in production.

The Wrapper Trap: Why Most Enterprise AI Projects Fail Before They Start

Jarrad Bermingham — Thu, 12 Feb 2026 04:10:02 +0000

I've assessed the AI readiness of 4 mid-market enterprises, analyzing 214+ repositories and hundreds of architecture decisions. The same anti-pattern appears in every single one.

I call it the Wrapper Trap.

What Is the Wrapper Trap?

The Wrapper Trap is when a company's "AI initiative" is a thin wrapper around an LLM API — typically OpenAI's chat completions endpoint — with no evaluation, no pipeline architecture, and no data integration.

It looks like this:

# The entire "AI feature"
response = openai.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": user_input}]
)
return response.choices[0].message.content

That's it. The company's AI roadmap is a single API call behind a UI.

Why It's a Trap

The Wrapper Trap feels productive. You ship something fast. The demo looks impressive. Leadership sees "AI" in the product.

But three things happen:

1. No Evaluation = No Improvement

Without measurement, you can't improve. When every response comes from a black box with no scoring, no retrieval metrics, no user feedback loop — you have no idea if your "AI feature" is working.

I've seen companies run wrapper-based AI features for 6+ months with zero measurement of answer quality.

2. No Data Integration = No Moat

A wrapper doesn't use your data. It uses OpenAI's training data. Which means any competitor can build the exact same thing in an afternoon.

The companies that build defensible AI products integrate their proprietary data: customer interactions, domain-specific knowledge bases, internal processes. That requires RAG pipelines, embedding strategies, and evaluation harnesses — not a single API call.

3. Scaling Costs Explode

Wrappers send the entire context every time. No caching, no chunking, no retrieval optimization. When usage scales 10x, costs scale 10x.

Production AI systems use vector retrieval to send only relevant context. A well-built RAG pipeline can reduce token costs by 60–80% while improving answer quality.

The Other Anti-Patterns

The Wrapper Trap is the most common, but it's not alone. Across 214+ repos, I've identified a consistent pattern set:

The Island Problem

AI features built in isolation from each other. Company has 3 teams each building their own OpenAI integration with their own prompt library, their own error handling, and zero shared infrastructure.

Cost: Duplicated engineering effort, inconsistent user experience, no knowledge sharing.

The Prompt-Only Architecture

All intelligence lives in the prompt. No tool use, no retrieval, no structured outputs. When the model changes or the prompt gets too long, everything breaks.

Cost: Fragile systems that degrade unpredictably with model updates.

The Dashboard Trap

Analytics dashboards that report on AI usage (API calls, tokens consumed, cost) but not AI performance (answer quality, user satisfaction, task completion rate).

Cost: Optimizing for the wrong metrics. Cost goes down, value goes down with it.

What Good Looks Like

The enterprises getting value from AI share common traits:

Pipeline architecture, not wrappers. Multiple agents with defined roles, shared context, and fault tolerance.
Evaluation from day one. Precision@K, recall, MRR — measured continuously, not as a one-time benchmark.
Data integration as a first-class concern. Vector stores, chunking strategies, embedding pipelines. Your data is your moat.
Shared AI infrastructure. One team owns the foundation (embedding service, evaluation harness, prompt library). Product teams build on top.
Measurable outcomes. Not "we added AI" but "answer quality improved 23% while token costs decreased 40%."

How to Escape

If you recognize the Wrapper Trap in your organization:

Step 1: Measure. Add evaluation to your existing AI features. Even simple metrics (user thumbs up/down, task completion rate) reveal whether your wrapper is delivering value.

Step 2: Retrieve. Build a retrieval pipeline for your domain data. ChromaDB locally, Pinecone for scale. Ground your AI in your data, not just the base model's training set.

Step 3: Evaluate. Build an evaluation harness. Track Precision@K, Recall@K, MRR. Know whether your retrieval is actually finding the right information.

Step 4: Orchestrate. Replace the single API call with a pipeline. Chunking → Retrieval → Generation → Evaluation. Each step measurable, each step improvable.

The Assessment Framework

At Bifrost Labs, I built the AI Readiness Scanner to automate this assessment across 8 dimensions:

Data readiness
Architecture maturity
Evaluation capability
Pipeline sophistication
Infrastructure (containerization, CI/CD)
Team capability
Integration depth
Governance and monitoring

The methodology behind the scanner identifies these anti-patterns from public signals — repository structure, dependency choices, architecture patterns, and documentation quality.

The 4 assessments delivered so far have identified $50K–$200K in automation opportunities per company. The biggest wins always come from escaping the Wrapper Trap.

I assess enterprise AI readiness at github.com/Jbermingham1. If you want to know where your organization stands, the assessment starts with your code — not a survey.

I Built a Framework for Multi-Agent MCP Servers in Python — Here's How

Jarrad Bermingham — Wed, 11 Feb 2026 13:48:23 +0000

Most MCP servers do one thing: wrap a single API call as a tool. But what if your tool needs multiple AI agents collaborating — analyzing, scoring, and reporting — before returning a result?
That's the problem I solved with agent-mcp-framework, an open-source Python library for building multi-agent MCP servers. Define agents, compose them into pipelines, and expose the whole thing as MCP tools that Claude, VSCode, or any MCP client can call.
Here's how it works and why I built it this way.

The Problem
I was building an internal tool that analyzes codebases — think automated code review with multiple specialized agents: one for quality issues, one for security vulnerabilities, one for architecture patterns, and one that combines everything into a scored report.
The MCP SDK gives you FastMCP for exposing tools, but there's no built-in way to:

Define reusable agent abstractions with lifecycle hooks
Compose agents into sequential, parallel, or conditional workflows
Handle errors gracefully across a multi-step pipeline
Format results consistently

I needed infrastructure. So I built it.

The Architecture
Agent (unit of work)
→ Pipeline (composition pattern)
→ AgentMCPServer (MCP exposure layer)
Three layers, each doing one thing.

Agents — The Building Blocks Every agent subclasses Agent and implements run(): pythonfrom agent_mcp_framework import Agent, AgentContext, AgentResult

class SecurityScanner(Agent):
async def run(self, context: AgentContext) -> AgentResult:
code = context.get("code", "")
findings = []

    dangerous_patterns = ["eval(", "exec(", "os.system("]
    for pattern in dangerous_patterns:
        if pattern in code:
            findings.append(f"Found {pattern} — potential injection risk")

    context.set("security_findings", findings)
    return AgentResult(
        success=True,
        output={"findings": findings, "count": len(findings)},
    )

The AgentContext is a shared data store that agents read from and write to. Each agent is self-contained — it pulls what it needs, does its work, and pushes results back.
There are three agent types:

Agent — subclass and implement run()
LLMAgent — built-in Anthropic client for Claude-powered agents
FunctionAgent — wrap any async function without subclassing

All agents get lifecycle hooks (before_run, after_run, on_error), automatic timing, and error handling for free.

Pipelines — The Composition Layer This is where it gets interesting. Four patterns: Sequential — agents run one after another, each seeing the updated context: pythonfrom agent_mcp_framework import SequentialPipeline

pipeline = SequentialPipeline("review", agents=[
QualityAnalyzer("quality"),
SecurityScanner("security"),
ReportGenerator("reporter"),
])
Parallel — agents run concurrently with isolated context copies (no race conditions), merged back after completion:
pythonfrom agent_mcp_framework import ParallelPipeline

All three analyze simultaneously, results merge

analysis = ParallelPipeline("analysis", agents=[
QualityAnalyzer("quality"),
SecurityScanner("security"),
ArchitectureReviewer("architecture"),
], max_concurrency=3)
Conditional — route to different agents based on context:
pythonfrom agent_mcp_framework import ConditionalPipeline

def router(ctx):
if ctx.get("language") == "python":
return "python-analyzer"
return "generic-analyzer"

pipeline = ConditionalPipeline("route", agents=[
PythonAnalyzer("python-analyzer"),
GenericAnalyzer("generic-analyzer"),
], router=router)
MapReduce — split work across agents, then reduce:
pythonfrom agent_mcp_framework import MapReducePipeline, AgentContext

pipeline = MapReducePipeline("batch",
agents=[FileAnalyzer(f"worker-{i}") for i in range(4)],
splitter=lambda ctx: [
AgentContext(data={"file": f}) for f in ctx.get("files")
],
reducer=lambda results, ctx: ctx.set(
"all_results", [r.output for r in results]
),
)

MCP Server — The Exposure Layer One line to turn any pipeline into an MCP tool: pythonfrom agent_mcp_framework import AgentMCPServer

server = AgentMCPServer("code-review", description="Multi-agent code review")
server.add_pipeline_tool(
pipeline,
name="review_code",
description="Analyze code for quality, security, and architecture issues.",
)

server.run() # Starts MCP server on stdio
Now any MCP client can call review_code and get a multi-agent analysis back.

Design Decision: Context Isolation in Parallel Pipelines
The trickiest part was parallel execution. When multiple agents run concurrently on the same context, you get race conditions — two agents writing to the same key, lost updates, stale reads.
My solution: each parallel agent gets a deep copy of the context. After all agents complete, their contexts merge back into the original. This means:

No locks, no mutexes, no shared mutable state
Each agent writes freely without stepping on others
The merge is deterministic (last-write-wins per key)

python# Inside ParallelPipeline.execute():
snapshots = [ctx.model_copy(deep=True) for _ in self.agents]

results = await asyncio.gather(
*[a.execute(s) for a, s in zip(self.agents, snapshots)]
)

Merge back

for snap in snapshots:
ctx.data.update(snap.data)
Simple, correct, no surprises.

Real-World Use Case: Code Review Server
The repo includes a complete code review server example with four agents:

QualityAnalyzer — checks line length, wildcard imports, missing docstrings
SecurityScanner — detects eval(), exec(), os.system(), pickle.loads()
ArchitectureReviewer — flags too many classes, global state, deep nesting
ReportGenerator — combines findings into a scored report (A through F)

The analysis agents run in parallel (they're independent), then the report generator runs sequentially (it needs all findings).
Here's what a scan of insecure code produces:
json{
"score": 52,
"grade": "C",
"quality": {"count": 3, "issues": ["..."]},
"security": {"count": 1, "findings": ["eval() — potential code injection"]},
"architecture": {"count": 1, "notes": ["Global state detected"]}
}
I've used this same pattern to build internal tools that analyze entire repositories — scanning tech stacks, detecting anti-patterns, and producing readiness assessments. The framework handles the orchestration; the domain logic lives in the agents.

What I'd Build Next
The framework is intentionally minimal right now — agents, pipelines, MCP server. Things I'm considering:

Agent-to-agent messaging — let agents communicate mid-pipeline
Retry policies — configurable retry with backoff for flaky LLM calls
Streaming results — progressive output as agents complete
Pipeline visualization — render the DAG of agent dependencies

Try It
bashpip install agent-mcp-framework
pythonfrom agent_mcp_framework import Agent, AgentContext, AgentResult, SequentialPipeline

class MyAgent(Agent):
async def run(self, context: AgentContext) -> AgentResult:
data = context.get("input", "")
return AgentResult(success=True, output=f"Processed: {data}")

pipeline = SequentialPipeline("demo", agents=[MyAgent("worker")])
80 tests. Zero lint errors. Typed with py.typed marker. MIT licensed.

GitHub: github.com/Jbermingham1/agent-mcp-framework
PyPI: pypi.org/project/agent-mcp-framework

If you're building multi-agent systems with MCP, I'd love to hear how you're approaching composition and orchestration. Drop a comment or open an issue on the repo.

214 Repos, Zero ML: What Public Signals Reveal About Mid-Market SaaS AI Strategy

Jarrad Bermingham — Tue, 10 Feb 2026 08:17:00 +0000

I analyzed the AI capabilities of 4 mid-market SaaS companies using only public data. Three distinct failure patterns emerged.

The Experiment
Over the past week, I ran public-signal analyses on 4 mid-market SaaS companies in the HR tech space. Each company positions AI as a core capability. Each has shipped multiple AI features. Each markets AI as their competitive differentiator.
I wanted to know: does the public engineering signal match the marketing?
The companies (anonymized):
CompanyRevenueEmployeesCustomersAI Features ShippedCompany A~$200M+ ARR900+5,000+3+ (analytics, summaries)Company B~$25-50M~1361,500+6 (copilot, review wizard, meeting coach)Company C~$75-100M~2873,500+6+ (predictive model, AI coach, reviews)Company D~$50-57M~424-5353,000+5+ (predictive analytics, AI scheduling, gen-AI)
Combined: ~$250-300M revenue. ~1,600 employees. 10,000+ customers.
The Signal: GitHub
Public GitHub repositories are an imperfect but meaningful signal for engineering capability. They show what a company's engineering team builds, values, and invests in.
Across all 4 companies:

Total public repos: 214 (178 + 5 + 29 + 2)
Repos involving machine learning: 0
Repos involving data science: 0
Repos involving model training: 0

Company A has 178 repos — all infrastructure, deployment tooling, and frontend libraries. Company C has 29 repos — all Django utilities, CI/CD, and API integrations. Company B has 5 repos — all forks of third-party libraries. Company D has 2 repos — one hackathon project and one webhook example.
Not a single company has a public repository that touches ML, data science, or model training.
Three Failure Patterns
Pattern 1: The Wrapper Trap
Seen in: Companies A, B (most clearly)
The Wrapper Trap is the most common pattern: shipping AI features built on managed LLM APIs (OpenAI, Claude, etc.) and marketing them as competitive differentiation.
The problem: if your AI is a wrapper on someone else's model, your competitor ships the same feature in weeks using the same API. There's no moat. There's no differentiation. There's a press release.
Company B (136 employees) has shipped 6 AI features in a year using this pattern. Which means a competitor with 600 employees can ship the same features in a quarter. The advantage isn't speed — it's the data. But the data is being used for dashboards, not training.
Pattern 2: AI by Acquisition
Seen in: Companies C, D (most clearly)
When companies recognize they need AI capability faster than they can build it, they acquire. Company C acquired an AI coaching startup. Company D acquired an AI-powered scheduling company.
The strategy is understandable. The risk is structural.
Post-acquisition engineering attrition averages 33% within 18 months. If the acquired team leaves and the acquiring company has zero ML repos and no ML hiring, they've paid for capability they can't maintain.
Company C has 29 GitHub repos — zero ML — and recently acquired an AI startup. Their engineering team has contracted ~35% in the past year. Who maintains the AI if the acquired engineers leave?
Pattern 3: The Island Problem
Seen in: Company D (most clearly)
The Island Problem appears in companies that have solved the Wrapper Trap through multiple acquisitions or partnerships. They have genuine AI assets — but those assets can't talk to each other.
Company D has:

A university R&D partnership producing predictive ML models
An acquired startup running AI-powered scheduling algorithms
Core platform gen-AI features via LLM APIs

Three AI engines. Three different tech stacks. Three separate data models. Zero cross-pollination.
The scheduling AI can't learn from engagement data. The predictive models can't improve scheduling. The LLM features can't leverage either model's intelligence.
The whole is less than the sum of parts.
The Data Moat Paradox
Here's the most striking finding: the companies sitting on the richest proprietary data are the worst at using it for AI.
Combined, these 4 companies have:

Billions of proprietary data points (survey responses, performance reviews, scheduling patterns, learning completions)
Decades of domain expertise
Millions of end users generating continuous data

All of it is being used for dashboards, analytics, and reports. None of it is being used for model training, fine-tuning, or domain-specific intelligence.
The data moats exist. The ML engineering to exploit them doesn't.
What Would Fix This
The fix isn't more AI features. It's three things:

One ML engineer. A single ML engineer changes the trajectory. They can evaluate whether "predictive models" are genuine ML or regression-with-marketing. They can fine-tune a model on proprietary data. They can assess whether an acquisition's AI is maintainable.
Data as training asset, not storage problem. Restructuring data pipelines to support model training — not just dashboards — is a one-time investment that compounds indefinitely. Every survey response, every performance review, every scheduling outcome becomes training data for models competitors can't replicate.
Integration over acquisition. Before the next AI acquisition, invest in connecting existing AI assets. Make the scheduling AI learn from the engagement data. Make the predictive model inform the coaching recommendations. Integration creates compound value; isolated acquisition creates diminishing returns. The Gap Is Widening The gap between AI marketing and AI engineering in mid-market SaaS isn't closing. It's widening. Companies are shipping more AI features while building less ML capability. Right now, in February 2026, I'm not seeing mid-market SaaS companies closing this gap on their own. The data moats are there. The engineering to exploit them isn't.

Methodology: All analysis based on public signals — GitHub repositories, job postings, product pages, press coverage, and financial data. No proprietary information was accessed. All companies anonymized.
I'm Jarrad Bermingham — I build production AI agent systems and open-source developer tooling at Bifrost Labs. Find our tools @bifrostlabs on npm.

I Built a 13-Agent AI System That Reviews Its Own Decisions. Here's the Architecture.

Jarrad Bermingham — Mon, 09 Feb 2026 13:15:25 +0000

Most people use Claude Code to write functions. I built a system where 13 specialized AI agents coordinate, challenge each other, and collectively make better decisions than any single agent could alone.
This isn't a weekend prototype. It runs daily. It has 141 tests across two shipped npm packages. It once scored a business opportunity 88/100 — without running a single web search. That failure, and six others like it, shaped every design decision you'll read below.
Here's the full architecture: routing engine, adversarial verification, lifecycle hooks, memory system, and the specific failures that made each one necessary.

Why Multi-Agent?
Single-agent AI has a ceiling, and it's lower than most people think.
Ask one agent to research a market, build the product, AND evaluate whether it's worth building — you get confirmation bias baked into every step. The agent that researched will defend its findings. The agent that built will justify its architecture. The agent that evaluated will anchor on work already done.
I hit this wall repeatedly. My single-agent setup would produce 2,000 words of reasoning explaining why a strategy was brilliant, then I'd spend 10 minutes on Google and find three competitors doing it better. The reasoning was airtight. The premises were wrong.
Multi-agent orchestration solves this through separation of concerns:

Specialists handle what they're good at
Adversaries exist solely to find flaws
A coordinator synthesizes without getting attached to any one perspective

The result: better decisions, caught earlier, with a paper trail of why.

The Architecture
User (CEO) → "What" — business intent
↓
Lead Agent (CTO) → "How" — all technical decisions
↓
┌─────────────┬─────────────┬──────────────┐
│ Routing: │ │ │
│ Skill? │ Solo? │ Subagent? │ Agent Team?
↓ ↓ ↓ ↓
Skills /cmd Solo Execute Static Agents Dynamic Teams
(19 skills) (< 2 min) (1-3 agents) (custom composition
+ mandatory adversary)
The critical design decision: the lead agent never codes directly on complex tasks. It classifies the request, composes the right team, delegates, and synthesizes. The coordinator coordinates. The specialists specialize.
This sounds obvious until you build it. My first instinct was having the lead agent "help out" when it knew the answer. That creates a god-agent that subtly biases team output because it already has an opinion before the specialists even start. Forcing strict delegation eliminated an entire class of coordination bugs.

The Routing Engine
Every request hits a four-level decision tree before any work begins:
→ Existing skill handles it? → DELEGATE (19 skills)
→ Trivial (< 2 min, well-defined)? → SOLO (lead executes directly)
→ Moderate (2-5 steps, single domain)? → SPAWN 1-3 specialist agents
→ Complex (5+ steps OR cross-domain)? → AGENT TEAM (dynamic + mandatory DA)
Routing matters because agent overhead is real. Spawning a 3-agent team to rename a variable wastes tokens, time, and context window. Running solo on a strategic decision is dangerous — you're back to single-agent confirmation bias.
The router's job is proportional response. Match the complexity of the tool to the complexity of the task.
One rule I learned the hard way: when in doubt, route UP, not down. Treating a complex task as moderate is far more costly than treating a moderate task as complex. Over-routing wastes tokens. Under-routing wastes decisions.

The 13 Static Agents
Each agent has a defined role, model tier, toolset, and memory scope:
AgentFunctionModelDevil's AdvocateFind flaws, score proposals, kill bad ideasOpusStrategic AdvisorHigh-level strategy, market positioningOpusOpponent ModelerGame theory, competitive analysisOpusResearcherWeb research, data gathering, market analysisSonnetVerifierQuality gates, completion validationSonnetTest RunnerExecute and validate test suitesSonnetDebuggerRoot cause analysis, error diagnosisSonnetCode ReviewerArchitecture review, anti-patterns, code qualitySonnetSecurity AuditorVulnerability detection, dependency risksSonnetSession DistillerCompress session learnings for future contextSonnetUpgrade AnalystDependency analysis, breaking change detectionSonnetPR ReviewerPull request quality, merge readinessSonnetRLM ProcessorRecursive reasoning, iterative refinementSonnet
Notice the model distribution: only 3 agents run on Opus (the reasoning-heavy ones), and the rest use Sonnet. This wasn't the original design — I'll explain why in the Failures section.
Static agents handle solo specialist delegation. For complex tasks, the system composes dynamic teams — fresh agents with custom system prompts tailored to the specific problem. No two teams look the same, because no two complex problems have the same shape.

The Part That Changed Everything: Adversarial Verification
This is the section I wish someone had written before I learned it the hard way.
Early in the project, I had the system evaluate a business opportunity. The DA agent ran its protocol, synthesized findings, and delivered a score: 88/100. Strong proceed. Compelling reasoning. Specific recommendations.
It hadn't run a single web search.
The 88 was sophisticated-sounding analysis with a confidence number attached. No competitor research. No market validation. No external data of any kind. Just... vibes with decimal points.
I almost shipped a strategy based on that score. That near-miss became the most important design constraint in the entire system.
4 Tiers of Adversarial Depth
L3: MULTI-ADVERSARIAL ──── Devil's Advocate + Contrarian + Pre-Mortem
For: Strategic, irreversible decisions

L2: FULL DA PROTOCOL ───── 5-phase: Claims → Verify → Belief Gap → Pre-Mortem → Score
For: Complex tasks, new builds, significant resource commitments

L1: QUICK CHALLENGE ────── 3 adversarial questions before output delivery
For: Moderate tasks, recommendations, estimates

L0: SELF-CHECK ─────────── 3 assumption checks before ANY output (always active)
For: Everything — no exceptions, no override
L0: The Foundation (Always Active)
L0 is embedded in every agent's system prompt. Before delivering any output, every agent silently runs three checks:

What assumption am I making that I haven't verified?
What's the strongest argument against my conclusion?
What would I be wrong about if challenged by a domain expert?

This costs almost nothing — three questions before each response, no external calls. But it catches unverified assumptions at the source, before they propagate through multi-agent handoffs where they become much harder to trace.
L2: Where It Gets Interesting
The Devil's Advocate agent runs a 5-phase protocol:
Phase 1: Claim Extraction — Identify every falsifiable claim in the proposal. Not opinions, not framing — specific claims that can be tested against reality.
Phase 2: Adversarial Verification — Search for contradicting evidence. Here's the key constraint: 60%+ of searches must seek disconfirmation. The default LLM behavior is to search for supporting evidence. You have to explicitly force the opposite. The search queries aren't "why X is a good idea" but "why X fails," "X competitors better than," "problems with X approach."
Phase 3: Belief Gap Analysis — What does the team wish were true vs. what is true? This catches motivated reasoning — the gap between the conclusion you want and the evidence you have.
Phase 4: Pre-Mortem — "It's 6 months later and this failed. Why?" Generate 5 independent failure scenarios. This reframes evaluation from "will this work?" (optimism bias) to "how could this fail?" (much more productive).
Phase 5: Scoring — 0-100 weighted rubric across evidence quality, market fit, execution feasibility, and risk factors. Score 70+ to proceed, 50-69 to refine, below 50 to kill.
The Double-DA Rule
After the 88/100 incident, I added a safeguard: any score above 80 automatically triggers a second, independent DA run. The second evaluator has no access to the first's findings, reasoning, or score.
The reconciliation logic:

If both scores are within 10 points → higher score stands
If the gap is larger than 10 → the lower score wins

The reasoning: if two independent evaluations diverge by more than 10 points, the optimistic run missed something the skeptical run caught. Defaulting to the lower score builds in a systematic pessimism bias for high-confidence assessments — which is exactly where overconfidence is most dangerous.
This rule has killed three initiatives that would have wasted months of development time.
7 Mandatory Research Gates
The DA can't score a proposal until it passes all seven:

At least 3 web searches executed (no armchair analysis)
At least 1 search explicitly seeking contradicting evidence
Competitor/alternative analysis included (minimum 2 alternatives)
Market size claim backed by external source (not LLM reasoning)
Technical feasibility verified against real constraints
"Who else has tried this?" check completed
First-principles cost estimate included If any gate fails, the DA cannot issue a score. It must report which gates failed and what information is missing. This makes "confident ignorance" structurally impossible.

Lifecycle Hooks: Automated Quality Enforcement
Claude Code supports lifecycle hooks — shell scripts that trigger on specific events. I use 10 of them to enforce quality gates that no agent can bypass:
HookTriggerPurposesession-start.shSession beginsLoad previous context + memorypre-compact.shBefore context compressionSave session state before data losssession-end.shSession endsPersist learnings, distill sessionverify-before-complete.shBefore task completionBlock premature completionteammate-idle-check.shAgent goes idleForce DA verdict deliverytask-completed-gate.shTask marked doneLog metrics, update pipelinevalidate-search-quality.shAfter web searchesEnforce research depth minimums
The two most important hooks solve specific failure modes I hit repeatedly:
verify-before-complete.sh blocks any task from being marked complete until the verifier agent has signed off. Without this, agents declare victory the moment code compiles. "It works" and "it's correct" are different statements — this hook enforces the distinction.
teammate-idle-check.sh catches a subtler problem: the Devil's Advocate going idle before delivering a verdict. In multi-agent teams, the DA reads other agents' outputs and then... does nothing. It "participated" without actually challenging anything. This hook detects when the DA hasn't delivered a written verdict and forces one before the team can proceed.
These hooks are the immune system. They don't make the system smarter — they prevent specific, known failure modes from recurring.

Skills: The Fast Path
19 skills act as composable workflows invoked via slash commands. Each is self-contained with its own execution logic:
/tdd → Test-driven development (failing test → minimal code → refactor)
/auto-orchestrate → Classify task complexity, compose optimal agent team
/devils-advocate → Full L2 adversarial protocol
/research → Web research with verification gates
/council-of-winners → Elite strategy: power plays, asymmetric upside identification
/prospect-scan → Company AI maturity assessment (10-point rubric)
/commit-push-pr → Git workflow: branch → commit → push → PR
/self-audit → Full-spectrum system health check
/produce-deliverable → End-to-end client deliverable pipeline
/distribute → Generate platform-specific content from any session
Skills are the routing engine's fast path. When a request maps cleanly to an existing skill, there's zero routing overhead — the lead agent recognizes the pattern and delegates directly. The router checks for skill matches first, before evaluating whether to spawn agents.
After 3+ successful uses of the same workflow pattern, the system identifies candidates for new skills — turning repeated multi-step processes into one-command invocations.

The Memory System
Multi-agent systems have a context problem. Each agent starts fresh. But institutional knowledge — past decisions, known failure modes, project context — needs to persist without bloating every session's context window.
Three-layer solution:
Layer 1: Auto-loaded (every session, ~4KB budget)
Core behavior rules, priority stack, active operations summary, agent descriptions (for routing decisions). This loads on every session start via the session-start.sh hook.
The budget matters. I enforce hard limits:

Core config: < 80 lines
Memory summary: < 50 lines
Total auto-load: < 4KB

Without these limits, memory files grow unbounded. A 20KB auto-load eats 15% of your context window before you've typed a single prompt. That's not a cleanup task — it's an architectural constraint.
Layer 2: On-demand (loaded when relevant)
46 reference documents covering orchestration patterns, adversarial depth protocols, model selection guides, strategic positioning, and past operation analyses. Only loaded when the current task requires that specific knowledge.
The lead agent's routing decision includes identifying which reference docs the spawned agents will need. A code review task loads the architecture standards. A strategic decision loads the competitive analysis and failure library. A research task loads the verification protocols.
Layer 3: Persistent (across sessions)
Per-agent isolated memory directories plus session distillation files — compressed learnings from previous sessions generated by the Session Distiller agent. The session-end.sh hook triggers distillation automatically: what was decided, what failed, what should inform future sessions.
This creates a feedback loop: sessions produce learnings → learnings load into future sessions → future sessions build on past context without re-deriving it.

What I've Shipped With This System
The architecture runs in production and has produced real, published artifacts:
Cost Guardian (@bifrostlabs/cost-guardian on npm) — Real-time token cost tracking for Claude Code sessions. Tracks spend per agent, per session, with budget alerts and cost breakdowns. 62 tests.
Claude Shield (@bifrostlabs/claude-shield on npm) — Security lifecycle hooks that block destructive commands before they execute. Pattern matching against known dangerous operations with configurable severity levels. 79 tests.
Both packages built using TDD (the /tdd skill), verified by the adversarial system, and published through the /commit-push-pr automated workflow.

The 7 Failures That Shaped the Architecture
I'm sharing these in detail because the architecture only makes sense in light of the problems it solved. Every guardrail exists because something specific broke.
Failure 1: The 88/100 Score With Zero Research
What happened: DA evaluated a business opportunity. Produced 2,000 words of analysis. Scored 88/100. Had not executed a single web search. No competitor data. No market validation.
Root cause: The DA protocol was reasoning-only. It could construct sophisticated arguments entirely from the LLM's training data and pattern matching. No requirement for external evidence.
Fix: 7 mandatory research gates. 60%+ of searches must seek contradicting evidence. No score can be issued until all gates pass.
Failure 2: Panic-Pivoting on New Information
What happened: Competitive intelligence arrived showing a well-funded competitor in the space. The system immediately recommended abandoning the entire strategy — not adjusting the approach, but wholesale pivot.
Root cause: No distinction between "this new info invalidates our thesis" and "this new info invalidates specific tactics." The system treated all threatening information as existential.
Fix: Anti-pivot rule. New intel triggers a thesis-vs-tactics triage: Does this contradict why we're doing this (thesis) or how we're doing it (tactics)? Thesis invalidation requires L3 review. Tactics adjustment requires L1. Most "pivots" are actually tactical adjustments that don't require strategic rethinking.
Failure 3: Building the Zero-Revenue Component
What happened: The system had three workstreams: free tools, content distribution, and a revenue-generating service platform. 100% of execution went to free tools. Weeks of development, zero progress on anything that would generate income.
Root cause: The router treated all "build" tasks equally. It didn't distinguish between building a free open-source tool and building the paid service that sustains the business. Both looked like "coding tasks" to the routing engine.
Fix: Revenue reality gate in the DA protocol. Before any workstream gets resources, the DA asks: "Does this directly lead to revenue within 90 days? If not, what's the explicit theory for how it converts to revenue later?"
Failure 4: Burning 50% of the Weekly Token Budget in One Session
What happened: Spawned 5 Opus-tier agents for research tasks. Each running full context, full reasoning, full analysis. The session cost more than the previous week combined.
Root cause: The model selection guide existed as a reference doc but wasn't enforced. Agents defaulted to the most capable model because nothing stopped them.
Fix: Hard constraints: maximum 3 parallel agents, maximum 1 Opus agent per team, research agents always use Sonnet. These aren't guidelines — they're enforced by the routing engine.
Failure 5: Strategy Oscillation
What happened: Six strategic directions in four months. Each one researched, architected, partially built, then abandoned when the next "better" idea emerged. Zero revenue from any of them.
Root cause: No commitment mechanism. Every new analysis could trigger a full strategic pivot. The system was optimized for evaluating strategies, not for executing them.
Fix: Strategy Lock — a config file that requires explicit CEO override to change strategic direction. The DA can recommend adjustments, but wholesale pivots require human intervention. The lock has held for the current strategy. Override count: 0.
Failure 6: Absence as Evidence
What happened: System searched for competitors on six platforms. Found nothing. Concluded: "zero competition, massive opportunity." Every platform actually had multiple established competitors — the searches just used the wrong queries.
Root cause: Treating "I didn't find it" as "it doesn't exist." The system didn't distinguish between exhaustive search and unsuccessful search.
Fix: DA Failure Library with pattern matching. "Absence ≠ evidence" is now a named pattern. When a search returns zero results, the system flags it for manual verification and tries alternative search queries before drawing conclusions.
Failure 7: Survivor Bias in Success Stories
What happened: System researched solo consulting success stories to validate the business model. Found dozens. Concluded the model was highly viable.
Root cause: It only found success stories because failures don't write blog posts. The actual solo consulting failure rate is ~80% within the first year.
Fix: DA protocol now requires searching for failure rates alongside success stories. Any market validation must include base rate data, not just examples of people who made it.
Each failure became a permanent entry in the DA Failure Library — a pattern-matching system that checks new proposals against past mistakes. The system doesn't just learn from failures in the abstract; it maintains a structured database of exactly how it failed and checks whether new proposals exhibit the same patterns.

Key Lessons
Separate evaluation from execution. The agent that builds something will defend it. A separate adversary catches what the builder won't see. This isn't just good practice — it's the single highest-leverage architectural decision I made.
Enforce verification, don't suggest it. Having a DA protocol didn't help when it was optional. The verify-before-complete hook, the mandatory research gates, the teammate-idle-check — these work because they're structural, not cultural. An agent can't skip them.
Compose teams dynamically. Every complex task is different. Composing fresh agent teams with task-specific system prompts outperforms recycling the same agent template. The overhead of writing a custom prompt is trivial compared to the cost of a misfit team.
Context discipline is architecture, not cleanup. Without size limits on auto-loaded memory, context bloat degrades everything — reasoning quality, response speed, token cost. The 4KB budget and the 80-line config limit are design decisions, not afterthoughts.
Build the failure feedback loop. The Double-DA rule, the verify-before-complete hook, the DA Failure Library — each exists because the system failed in a specific, observable way. The meta-skill isn't building agents. It's building the system that turns agent failures into agent guardrails.

What's Next
I'm open-sourcing components of this architecture and documenting the patterns that transfer to any multi-agent system. The core principles aren't Claude-specific:

Route by complexity, not by habit
Enforce adversarial checking at every tier
Compose teams dynamically, not from templates
Make your failures into your guardrails

If you're building multi-agent systems — on Claude Code, LangChain, CrewAI, or anything else — I'd genuinely like to hear what coordination problems you've hit and how you solved them.

Built with Claude Code. 13 agents, 19 skills, 10 lifecycle hooks, 141 tests, 7 documented failures. Running in production at @bifrostlabs on GitHub and npm.