<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jarrad Bermingham</title>
    <description>The latest articles on DEV Community by Jarrad Bermingham (@jarradbermingham).</description>
    <link>https://dev.to/jarradbermingham</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3762058%2F876ef262-c7ca-46d0-b8f4-af0ad7ecdfe4.jpeg</url>
      <title>DEV Community: Jarrad Bermingham</title>
      <link>https://dev.to/jarradbermingham</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jarradbermingham"/>
    <language>en</language>
    <item>
      <title>The Insider Screamed. The Outsider Whispered. Same Truth, Different Volume.</title>
      <dc:creator>Jarrad Bermingham</dc:creator>
      <pubDate>Mon, 06 Apr 2026 09:10:12 +0000</pubDate>
      <link>https://dev.to/jarradbermingham/the-insider-screamed-the-outsider-whispered-same-truth-different-volume-8no</link>
      <guid>https://dev.to/jarradbermingham/the-insider-screamed-the-outsider-whispered-same-truth-different-volume-8no</guid>
      <description>&lt;p&gt;A technical team spent months warning their leadership about critical security issues in their own infrastructure. Missing security headers. Third-party trackers running without consent on government-connected portals. Configurations that any competent attacker would find in minutes.&lt;/p&gt;

&lt;p&gt;Leadership heard the warnings. Filed them. Did nothing.&lt;/p&gt;

&lt;p&gt;Then an outsider — someone with no relationship to the organization, no access to their internal systems, no special tools — spent 90 minutes looking at what was publicly visible from a browser.&lt;/p&gt;

&lt;p&gt;They found the same things the internal team had been screaming about.&lt;/p&gt;

&lt;p&gt;The outsider sent one message. Not a report. Not a presentation. Not a budget request. Just: "Here's what's visible. You should know."&lt;/p&gt;

&lt;p&gt;The organization fixed every issue that same day.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why External Validation Works When Internal Warnings Don't
&lt;/h2&gt;

&lt;p&gt;This pattern isn't unique. I've seen it across every industry:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Internal team identifies risk — they document it, escalate it, present it with evidence&lt;/li&gt;
&lt;li&gt;Leadership acknowledges it — nods, takes the report, puts it in the backlog&lt;/li&gt;
&lt;li&gt;Nothing happens — because internal warnings feel like cost centers&lt;/li&gt;
&lt;li&gt;External party confirms the same findings — suddenly it's urgent&lt;/li&gt;
&lt;li&gt;Everything gets fixed overnight — the internal team is "finally being heard"&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because when your own team says "we're vulnerable," leadership hears "we need more budget."&lt;/p&gt;

&lt;p&gt;When an outsider says "you're vulnerable," leadership hears "we're about to be in the news."&lt;/p&gt;

&lt;p&gt;The messenger changes the message — even when the words are identical.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Actually Checked (The 90-Minute Methodology)
&lt;/h2&gt;

&lt;p&gt;No scanning tools. No exploitation. No terms of service violations. Just a browser, curl, and publicly available information.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Security Headers
&lt;/h3&gt;

&lt;p&gt;Every web server sends response headers that reveal how seriously security is taken. The absence of certain headers IS the finding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;X-Frame-Options&lt;/strong&gt; — prevents clickjacking. Missing = your pages can be embedded in attacker-controlled frames.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content-Security-Policy&lt;/strong&gt; — controls what scripts can execute. Missing = XSS attacks are trivially easy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;X-Content-Type-Options&lt;/strong&gt; — prevents MIME sniffing. Missing = browsers may execute files as code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strict-Transport-Security&lt;/strong&gt; — enforces HTTPS. Missing = traffic can be intercepted.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How to check: Open your browser's developer tools. Go to Network tab. Click any request. Look at the response headers.&lt;/p&gt;

&lt;p&gt;If you see none of the above — nobody configured them. That's a finding.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Third-Party Trackers
&lt;/h3&gt;

&lt;p&gt;Open any page. Look at what external domains are loaded. Google Analytics, Hotjar, DoubleClick, Facebook Pixel — each one is a third-party script running in your users' browsers.&lt;/p&gt;

&lt;p&gt;The questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is there a consent mechanism? (GDPR/Privacy Act compliance)&lt;/li&gt;
&lt;li&gt;Are tracking scripts on authentication pages? (credential exposure risk)&lt;/li&gt;
&lt;li&gt;Are tracking scripts on government or healthcare portals? (regulatory violation)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. TLS Configuration
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Is HTTPS enforced, or does HTTP still work?&lt;/li&gt;
&lt;li&gt;Is the certificate valid and current?&lt;/li&gt;
&lt;li&gt;Are older TLS versions (1.0, 1.1) still enabled?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. DNS and Subdomains
&lt;/h3&gt;

&lt;p&gt;What subdomains are publicly visible? Are there staging environments exposed? Old portals still running? Development servers accessible from the internet?&lt;/p&gt;

&lt;p&gt;Free tools: crt.sh for certificate transparency logs. Shows every subdomain that's ever had an SSL certificate.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Publicly Accessible Services
&lt;/h3&gt;

&lt;p&gt;Are there login portals, admin panels, API documentation, or status pages accessible without authentication?&lt;/p&gt;




&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;p&gt;In 90 minutes, 5 of 6 checks revealed issues. On an organization that positions itself as a leader in technology and cybersecurity.&lt;/p&gt;

&lt;p&gt;Their own technical staff had been documenting these exact problems internally. For months.&lt;/p&gt;

&lt;p&gt;One external message. Same day fix.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Lesson
&lt;/h2&gt;

&lt;p&gt;The lesson isn't "hire external consultants." The lesson is about how organizations process risk signals.&lt;/p&gt;

&lt;p&gt;Internal signals get filtered through politics. The person raising the alarm has a budget to defend, a role to protect, a relationship to maintain. Their warning comes wrapped in organizational context. It can be deflected with "we'll get to it next quarter."&lt;/p&gt;

&lt;p&gt;External signals bypass the political filter. The outsider has no agenda inside the organization. Their warning arrives without organizational context. It can't be deflected — because the person sending it has no reason to send it unless the problem is real.&lt;/p&gt;

&lt;p&gt;The fix isn't to outsource all security assessment. The fix is to create internal mechanisms that give security findings the same weight as external ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Empower the internal team to act, not just report.&lt;/strong&gt; If they can identify the issue, give them the authority to fix it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create external validation loops.&lt;/strong&gt; Periodic external assessment isn't a replacement for internal teams — it's an amplifier. It gives internal findings the political weight they need to get prioritized.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Track time-to-fix for internal vs. external findings.&lt;/strong&gt; If external findings get fixed in 24 hours and internal findings take 6 months — the problem isn't technical. It's organizational.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What You Can Do Right Now
&lt;/h2&gt;

&lt;p&gt;Open a terminal. Run this against your own domain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-sI&lt;/span&gt; https://yourdomain.com | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-iE&lt;/span&gt; &lt;span class="s2"&gt;"x-frame|content-security|x-content-type|strict-transport"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the output is empty — you have the same problem. And your team probably already knows.&lt;/p&gt;

&lt;p&gt;The question is: are you listening to them?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I build AI-powered security assessment tools and help organizations find what's publicly visible about their infrastructure before attackers do.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://bifrostlabs.co" rel="noopener noreferrer"&gt;bifrostlabs.co&lt;/a&gt; | &lt;a href="https://x.com/ExitVelocity_" rel="noopener noreferrer"&gt;X&lt;/a&gt; | &lt;a href="//www.linkedin.com/in/jarrad-bermingham"&gt;LinkedIn&lt;/a&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>cybersecurity</category>
      <category>devops</category>
      <category>webdev</category>
    </item>
    <item>
      <title>I Mapped the AI Attack Surface Nobody Else Has: Introducing AAISAF</title>
      <dc:creator>Jarrad Bermingham</dc:creator>
      <pubDate>Wed, 25 Mar 2026 11:17:38 +0000</pubDate>
      <link>https://dev.to/jarradbermingham/i-mapped-the-ai-attack-surface-nobody-else-has-introducing-aaisaf-2l0b</link>
      <guid>https://dev.to/jarradbermingham/i-mapped-the-ai-attack-surface-nobody-else-has-introducing-aaisaf-2l0b</guid>
      <description>&lt;p&gt;Yesterday a supply chain attack hit litellm — 97 million monthly downloads. One pip install. SSH keys, AWS credentials, API tokens, git secrets, crypto wallets — all silently exfiltrated in under an hour.&lt;br&gt;
This is TA05 in AAISAF — a framework I published today.&lt;/p&gt;

&lt;p&gt;The Problem&lt;br&gt;
Every company that deployed an AI system in 2023–2025 created an attack surface their security team has never seen.&lt;br&gt;
Prompt injection. RAG pipeline poisoning. Agent-to-agent manipulation. MCP server exploitation. Voice AI bypass. Supply chain attacks on AI dependencies.&lt;br&gt;
Existing frameworks tell you what to worry about. Nobody told you how to actually test for it.&lt;/p&gt;

&lt;p&gt;OWASP LLM Top 10 — vulnerability categories, no testing methodology&lt;br&gt;
MITRE ATLAS — adversary mapping, no practitioner guidance&lt;br&gt;
NIST AI RMF — governance structure, no attack techniques&lt;/p&gt;

&lt;p&gt;We built the missing layer.&lt;br&gt;
What AAISAF Is&lt;br&gt;
AAISAF (AI Security Assessment Framework) is an open-source, technique-level methodology for assessing AI system security.&lt;br&gt;
Structured like MITRE ATT&amp;amp;CK — tactic → technique → sub-technique — applied to AI systems.&lt;/p&gt;

&lt;p&gt;10 tactic categories&lt;br&gt;
87 assessment techniques&lt;br&gt;
9 domain checklists&lt;br&gt;
6 compliance framework mappings&lt;br&gt;
3 assessment types (30min / 1-2 day / 5-10 day)&lt;br&gt;
5-level maturity model&lt;/p&gt;

&lt;p&gt;Each technique includes attack description and prerequisites, AISS severity score (0.0–10.0), detection guidance, remediation steps, and mandatory evidence (CVE, documented incident, or peer-reviewed research).&lt;/p&gt;

&lt;p&gt;Two Attack Surfaces With Zero Prior Coverage&lt;br&gt;
TA10 — MCP Server &amp;amp; Tool Security&lt;br&gt;
Model Context Protocol is Anthropic's standard for connecting AI to external tools. Released November 2024. Now the de facto integration standard with thousands of production deployments globally.&lt;/p&gt;

&lt;p&gt;CVE-2025-6514 (CVSS 9.6). 1,467 exposed servers on the internet. Zero frameworks covering it.&lt;/p&gt;

&lt;p&gt;We built 12 techniques:&lt;br&gt;
MCP Attack Surface&lt;br&gt;
├── Tool Poisoning via Malicious Description (AISS 8.1)&lt;br&gt;
├── Rug Pull Attack (AISS 8.4)&lt;br&gt;
├── Tool Shadowing (AISS 8.0)&lt;br&gt;
├── Cross-Origin Injection via MCP Resource (AISS 8.3)&lt;br&gt;
├── Privilege Escalation via Tool Chain (AISS 8.7)&lt;br&gt;
├── SSRF via MCP Tool (AISS 7.2)&lt;br&gt;
├── Data Exfiltration via Tool Output (AISS 7.5)&lt;br&gt;
├── MCP Auth Bypass (AISS 9.1)&lt;br&gt;
├── Malicious Server Registration (AISS 8.5)&lt;br&gt;
├── Tool Argument Injection (AISS 7.0)&lt;br&gt;
├── Transport Layer Exploitation (AISS 7.3)&lt;br&gt;
└── Consent Fatigue Exploitation (AISS 5.8)&lt;/p&gt;

&lt;p&gt;I run MCP servers in production as part of a 13-agent AI orchestration system. These techniques came from understanding the architecture from the inside.&lt;br&gt;
TA06 — Voice AI Exploitation&lt;br&gt;
Millions of AI phone agents handle customer calls daily across healthcare, finance, customer service, and sales. Real-time. Autonomous. Trusted by default because it sounds human.&lt;br&gt;
No security framework had mapped the attack techniques against them.&lt;/p&gt;

&lt;p&gt;9 techniques:&lt;br&gt;
Voice AI Attack Surface&lt;br&gt;
├── Voice Prompt Injection via Speech (AISS 7.0)&lt;br&gt;
├── Synthetic Voice Spoofing / Deepfake (AISS 8.5)&lt;br&gt;
├── Conversation Flow Bypass (AISS 5.5)&lt;br&gt;
├── Audio Adversarial Examples (AISS 7.2)&lt;br&gt;
├── Credential Harvesting via Voice Agent (AISS 8.3)&lt;br&gt;
├── DTMF Signal Injection (AISS 6.8)&lt;br&gt;
├── Voice Agent Vishing (AISS 8.7)&lt;br&gt;
├── STT Pipeline Exploitation (AISS 5.8)&lt;br&gt;
└── Real-Time Voice Cloning in Active Call (AISS 9.0)&lt;/p&gt;

&lt;p&gt;I build and operate production voice agents on Retell AI infrastructure. Every technique here comes from first-hand knowledge of where these systems break.&lt;/p&gt;

&lt;p&gt;The AISS Scoring System&lt;br&gt;
Standard CVSS doesn't capture AI-specific risk dimensions.&lt;br&gt;
We built AISS — AI Impact Severity Score — CVSS-compatible 0.0–10.0 with five additional metrics:&lt;/p&gt;

&lt;p&gt;Autonomy Impact — can this attack trigger autonomous harmful action?&lt;br&gt;
Cascade Potential — single agent to system-wide propagation risk?&lt;br&gt;
Persistence — ephemeral to permanent compromise?&lt;br&gt;
Data Sensitivity Exposure — what does the attacker actually access?&lt;br&gt;
Financial Impact Potential — direct and indirect loss estimation&lt;/p&gt;

&lt;p&gt;Every one of the 87 techniques is scored. Boards understand it. Compliance teams can document against it.&lt;br&gt;
Compliance Mappings&lt;br&gt;
Every technique maps to:&lt;/p&gt;

&lt;p&gt;OWASP LLM Top 10 (2025)&lt;br&gt;
MITRE ATLAS&lt;br&gt;
NIST AI RMF + AI 600-1 (GenAI Profile)&lt;br&gt;
ISO/IEC 42001&lt;br&gt;
EU AI Act — high-risk system requirements hit August 2026 (5 months)&lt;br&gt;
Australian Privacy Act, Essential Eight, VAISS/AI6, SOCI Act&lt;/p&gt;

&lt;p&gt;Quick Start&lt;br&gt;
bashgit clone &lt;a href="https://github.com/Jbermingham1/aaisaf" rel="noopener noreferrer"&gt;https://github.com/Jbermingham1/aaisaf&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
1. Identify your system type (A–G: chatbot, RAG, agentic, multi-agent, voice, MCP, composite)
2. Choose your assessment type (30-min / standard / deep)
3. Work through the relevant checklists
4. Score findings using AISS
5. Map to compliance requirements
6. Report

## Repository Structure
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;aaisaf/&lt;br&gt;
├── framework/&lt;br&gt;
│   ├── tactics/          # 10 tactic overviews with attack trees&lt;br&gt;
│   ├── techniques/       # 87 individual technique files&lt;br&gt;
│   ├── compliance/       # 6 compliance mapping documents&lt;br&gt;
│   └── maturity/         # 5-level maturity model&lt;br&gt;
├── assessments/&lt;br&gt;
│   ├── checklists/       # 9 domain checklists&lt;br&gt;
│   └── scoring/          # AISS specification and templates&lt;br&gt;
└── references/&lt;br&gt;
    ├── glossary.md&lt;br&gt;
    ├── cve-index.md&lt;br&gt;
    └── bibliography.md&lt;/p&gt;

&lt;p&gt;What's Next&lt;br&gt;
ares-scanner — open-source tooling that automates the AAISAF methodology. The framework tells you what to test. The scanner runs the tests.&lt;br&gt;
Contributions welcome. If you've encountered AI attack techniques not in the framework — open a PR. The goal is for this to become the living standard the community maintains.&lt;br&gt;
CC BY-SA 4.0. Free forever. No vendor pitch.&lt;/p&gt;

&lt;p&gt;GitHub: github.com/Jbermingham1/aaisaf&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>opensource</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Anthropic Just Showed Us the Biggest Blind Spot in AI Adoption (2M Conversations Analysed)</title>
      <dc:creator>Jarrad Bermingham</dc:creator>
      <pubDate>Fri, 13 Mar 2026 11:36:20 +0000</pubDate>
      <link>https://dev.to/jarradbermingham/anthropic-just-showed-us-the-biggest-blind-spot-in-ai-adoption-2m-conversations-analysed-38kn</link>
      <guid>https://dev.to/jarradbermingham/anthropic-just-showed-us-the-biggest-blind-spot-in-ai-adoption-2m-conversations-analysed-38kn</guid>
      <description>&lt;p&gt;Last week, Anthropic published their most comprehensive analysis yet of how AI is actually being used in the economy. Not projections. Not hype. Real data from over two million Claude conversations, mapped against the entire US occupational database.&lt;br&gt;
The headline finding stopped me mid-scroll:&lt;/p&gt;

&lt;p&gt;AI can theoretically handle 94% of tasks in computer and mathematical roles. People are using it for 33%.&lt;/p&gt;

&lt;p&gt;That's not a technology gap. That's an adoption gap. And it's the single biggest efficiency blind spot in business today.&lt;br&gt;
Let me walk you through what the data actually says, what it means for your business, and what I'm seeing on the ground as someone who helps companies close exactly this kind of gap.&lt;/p&gt;

&lt;p&gt;The Data: Theoretical Capability vs. Actual Usage&lt;br&gt;
Anthropic's research combines two things that are rarely measured together:&lt;/p&gt;

&lt;p&gt;Theoretical capability — what percentage of an occupation's tasks could an LLM theoretically speed up or perform&lt;br&gt;
Observed usage — what people are actually using Claude for in real-world professional settings&lt;/p&gt;

&lt;p&gt;The gap between these two numbers tells the real story.&lt;br&gt;
By Occupation Category&lt;br&gt;
CategoryTheoreticalObservedComputer &amp;amp; Mathematical94%33%Office &amp;amp; Administrative~90%A fractionBusiness &amp;amp; Finance~85%Barely scratched&lt;br&gt;
Computer and math tasks make up roughly one-third of Claude.ai conversations and nearly half of API traffic — yet they're barely scratching the surface of what's possible. Office and admin roles, where 90% of tasks are theoretically automatable, are barely registering.&lt;/p&gt;

&lt;p&gt;The 10 Most Exposed Roles&lt;/p&gt;

&lt;p&gt;These aren't warehouse workers or truck drivers. Every single role on this list sits in a corporate office. They're knowledge workers — many of them your highest-paid employees.&lt;/p&gt;

&lt;p&gt;The Numbers That Should Worry Every Business Leader&lt;br&gt;
The earnings gap is inverted. Workers in the most AI-exposed occupations earn 47% more on average than workers with zero exposure. They're nearly 4x as likely to hold graduate degrees. The people most affected by AI aren't the ones businesses usually worry about protecting — they're the ones with the biggest salaries.&lt;br&gt;
Young workers are already feeling it. The research found a 14% drop in job-finding rates for 22-to-25-year-olds entering AI-exposed occupations post-ChatGPT compared to 2022. Entry-level positions in knowledge work are quietly contracting before it shows up in unemployment numbers.&lt;br&gt;
30% of workers have zero AI exposure. Cooks, mechanics, bartenders, lifeguards — roles requiring physical presence remain untouched. The divide between "AI-exposed" and "AI-proof" jobs is becoming a fault line in the labour market.&lt;/p&gt;

&lt;p&gt;What This Actually Means for Businesses&lt;br&gt;
Here's what I find most striking: this isn't an AI problem. It's a management problem.&lt;br&gt;
The tools already exist. Claude, GPT-4, Gemini — they can handle the vast majority of tasks in knowledge work today. The 94% theoretical coverage in computer and math roles isn't aspirational. It's current capability.&lt;br&gt;
So why is observed usage stuck at 33%?&lt;br&gt;
From what I see working with businesses, four patterns explain the gap:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;No Systematic Deployment Framework
Most companies have a ChatGPT subscription and a vague encouragement to "use AI more." That's it. No mapping of which workflows benefit most. No standardised prompts. No integration into existing toolchains. People experiment individually, hit a wall, and go back to doing things the old way.&lt;/li&gt;
&lt;li&gt;Individual Experimentation vs. Team-Wide Integration
One person on the team discovers that AI can draft their reports in 20 minutes instead of 3 hours. They keep doing it quietly. Nobody else on the team knows. There's no mechanism to share what works, standardise it, or scale it.&lt;/li&gt;
&lt;li&gt;The Measurement Problem
Nobody is tracking time saved. If you asked most managers "How much time does your team save using AI tools?" they'd shrug. Without measurement, there's no business case for expansion. Without a business case, there's no budget for proper deployment. The gap perpetuates itself.&lt;/li&gt;
&lt;li&gt;The "Waiting for Better" Trap
I hear this constantly: "We'll invest properly when AI gets better."
Meanwhile, the research shows that 97% of observed Claude tasks already fall into categories where AI is theoretically capable. And 68% involve tasks rated as fully feasible for an LLM to handle alone.
The capability is here. The deployment isn't.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What I'm Seeing in the Field&lt;br&gt;
I work as a Fractional Head of AI for SMEs — businesses that know they need to move on AI but don't have the in-house expertise to do it systematically. What the Anthropic data confirms matches what I see every engagement.&lt;br&gt;
The pattern is remarkably consistent:&lt;br&gt;
Before a systematic audit, most businesses estimate they're using AI for maybe 40-50% of what's possible. The actual number, once you map their workflows against what AI can handle today, is usually closer to 15-20%.&lt;br&gt;
The biggest gaps are almost never where leaders expect. Everyone thinks about AI for content generation and coding. The real untapped value is in the mundane: data processing, report summarisation, customer communication drafts, internal knowledge retrieval, meeting preparation, and compliance documentation. Tasks nobody thinks of as "AI tasks" because they've always been done manually.&lt;br&gt;
The fastest wins come from the boring stuff. A customer service team that implements AI-assisted response drafting sees measurable time savings in the first week. A finance team that uses AI for initial report drafting cuts their month-end close by days, not hours.&lt;br&gt;
The companies pulling ahead aren't the ones with the fanciest tools. They're the ones with a framework: audit, deploy, measure, iterate.&lt;/p&gt;

&lt;p&gt;A Framework for Closing the Gap&lt;br&gt;
If the Anthropic data has you thinking "we're probably on the wrong side of this gap," here's where to start.&lt;br&gt;
Step 1: Audit Your Workflows&lt;br&gt;
Map your team's actual tasks against AI capabilities. For each role, ask: what does this person spend time on every day, and which of those tasks could AI meaningfully accelerate?&lt;br&gt;
Be specific. "Marketing" isn't a task. "Writing first drafts of product descriptions based on feature specs" is.&lt;br&gt;
Step 2: Run a Focused Pilot&lt;br&gt;
Pick the 3 highest-impact workflows from your audit. "Highest impact" means: done frequently, time-consuming, and involving tasks AI handles well — writing, analysis, data processing, summarisation.&lt;br&gt;
Give your team structured prompts and workflows. Not "here's a ChatGPT login, figure it out." Actual documented processes for how AI fits into each workflow. Two weeks is enough to get meaningful data.&lt;br&gt;
Step 3: Measure Relentlessly&lt;br&gt;
Track time-to-completion before and after. Track output quality. Track team adoption rates. Build the business case with real numbers from your own organisation, not vendor promises.&lt;/p&gt;

&lt;p&gt;The measurement step is where most companies fail and most pilots die. Don't let it.&lt;/p&gt;

&lt;p&gt;Step 4: Scale What Works&lt;br&gt;
Take the workflows that proved out in the pilot, document them properly, train the full team, and integrate them into your standard operating procedures.&lt;br&gt;
Then go back to Step 1 and audit the next layer of workflows. The gap is big enough that most businesses can run this cycle 3-4 times before they even approach the frontier of what's possible.&lt;/p&gt;

&lt;p&gt;The Window Is Open&lt;br&gt;
The Anthropic data makes one thing clear: there is an enormous gap between AI capability and AI adoption. That gap represents real, measurable efficiency sitting on the table right now.&lt;br&gt;
But gaps close. As tools get easier, as competitors catch on, as the next generation of workers arrives expecting AI-native workflows — the advantage of being early narrows.&lt;br&gt;
The companies that build their AI deployment framework now are compounding their advantage every month. The ones waiting for AI to "get better" are falling behind at the same rate.&lt;br&gt;
The technology is ready. The data proves it. The question is whether your organisation has the framework to actually use what's already available.&lt;/p&gt;

&lt;p&gt;This analysis is based on Anthropic's Labor Market Impacts research paper (March 2026) and the Anthropic Economic Index (January 2026), which together analysed over 2 million Claude conversations mapped against the US Bureau of Labor Statistics occupational database.&lt;/p&gt;

&lt;p&gt;Jarrad Bermingham is the founder of Steadwise AI and works as a Fractional Head of AI, helping businesses close the gap between AI capability and actual adoption. Connect on LinkedIn.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>productivity</category>
      <category>aiops</category>
    </item>
    <item>
      <title>I built a free alternative to LangSmith — one decorator, local SQLite, zero infrastructure</title>
      <dc:creator>Jarrad Bermingham</dc:creator>
      <pubDate>Wed, 25 Feb 2026 10:03:16 +0000</pubDate>
      <link>https://dev.to/jarradbermingham/i-built-a-free-alternative-to-langsmith-one-decorator-local-sqlite-zero-infrastructure-2ink</link>
      <guid>https://dev.to/jarradbermingham/i-built-a-free-alternative-to-langsmith-one-decorator-local-sqlite-zero-infrastructure-2ink</guid>
      <description>&lt;p&gt;LangSmith wants $400/month. Helicone needs you to proxy your AI traffic through their servers. Both require accounts, API keys, and sending your data to someone else's cloud.&lt;/p&gt;

&lt;p&gt;I just wanted to know what my AI agents were costing me.&lt;/p&gt;

&lt;p&gt;So I built &lt;a href="https://github.com/Jbermingham1/bifrost-monitor" rel="noopener noreferrer"&gt;bifrost-monitor&lt;/a&gt; — a Python decorator that tracks every AI call locally. No accounts. No infrastructure. No data leaving your machine.&lt;/p&gt;

&lt;p&gt;Here's the full setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bifrost_monitor&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;monitor&lt;/span&gt;

&lt;span class="nd"&gt;@monitor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;support-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_ticket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Every call gets tracked — duration, tokens, cost, errors — stored in a local SQLite file.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I Built This
&lt;/h2&gt;

&lt;p&gt;I was running multiple AI agents in production. Some used Claude, some used GPT-4o, one used Gemini. I had zero visibility into what any of them cost.&lt;/p&gt;

&lt;p&gt;The existing options felt wrong:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;LangSmith&lt;/th&gt;
&lt;th&gt;Helicone&lt;/th&gt;
&lt;th&gt;bifrost-monitor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Setup&lt;/td&gt;
&lt;td&gt;Account + API key + proxy&lt;/td&gt;
&lt;td&gt;Account + API proxy&lt;/td&gt;
&lt;td&gt;pip install&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;$400/mo+&lt;/td&gt;
&lt;td&gt;$50/mo+&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data location&lt;/td&gt;
&lt;td&gt;Their cloud&lt;/td&gt;
&lt;td&gt;Their cloud&lt;/td&gt;
&lt;td&gt;Your machine&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I didn't need a dashboard. I needed a decorator and a CLI.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Decorator
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;@monitor&lt;/code&gt; decorator wraps your function without changing it. Sync or async — it detects automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@monitor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;classifier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify_email&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

&lt;span class="nd"&gt;@monitor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarizer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;summarize_doc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;anthropic_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Times execution with &lt;code&gt;time.perf_counter()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Auto-extracts token counts from the response object (duck-typed — works with Anthropic and OpenAI responses)&lt;/li&gt;
&lt;li&gt;Calculates cost using built-in pricing&lt;/li&gt;
&lt;li&gt;Records everything to SQLite&lt;/li&gt;
&lt;li&gt;Re-raises any exceptions after recording them&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The function behaves identically. Zero code changes to your business logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Auto Token Extraction
&lt;/h3&gt;

&lt;p&gt;This was the part I'm most pleased with. The decorator inspects your function's return value and duck-type detects token usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Anthropic responses → extracts:
#   usage.input_tokens
#   usage.output_tokens
#   usage.cache_read_input_tokens    (prompt caching)
#   usage.cache_creation_input_tokens
&lt;/span&gt;
&lt;span class="c1"&gt;# OpenAI responses → extracts:
#   usage.prompt_tokens
#   usage.completion_tokens
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your function returns something without a &lt;code&gt;.usage&lt;/code&gt; attribute, it still tracks everything else — duration, status, errors. Tokens just show as zero.&lt;/p&gt;

&lt;h3&gt;
  
  
  Built-in Pricing
&lt;/h3&gt;

&lt;p&gt;13 models ship with current pricing (as of mid-2025):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Anthropic&lt;/strong&gt; — Claude Opus 4.6, Sonnet 4.6, Haiku 4.5 (including cache token rates)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI&lt;/strong&gt; — GPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, GPT-4.1-nano&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google&lt;/strong&gt; — Gemini 2.5 Pro, Gemini 2.5 Flash&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Custom models are one call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bifrost_monitor&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ModelPricing&lt;/span&gt;

&lt;span class="n"&gt;pricing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ModelPricing&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;pricing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-fine-tune&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;input_per_m&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output_per_m&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;15.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;cache_read_per_m&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;# optional
&lt;/span&gt;    &lt;span class="n"&gt;cache_creation_per_m&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# optional
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost calculations use 8-decimal precision — accurate down to fractions of a cent across thousands of calls.&lt;/p&gt;

&lt;h3&gt;
  
  
  The CLI
&lt;/h3&gt;

&lt;p&gt;Query everything from your terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# What am I spending, broken down by model?&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;bifrost-monitor costs &lt;span class="nt"&gt;--group-by&lt;/span&gt; model

&lt;span class="c"&gt;# Which agents are failing?&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;bifrost-monitor errors &lt;span class="nt"&gt;--last&lt;/span&gt; 7d

&lt;span class="c"&gt;# Full summary for a specific agent&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;bifrost-monitor summary &lt;span class="nt"&gt;--name&lt;/span&gt; support-agent &lt;span class="nt"&gt;--last&lt;/span&gt; 24h

&lt;span class="c"&gt;# Recent runs with status&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;bifrost-monitor runs &lt;span class="nt"&gt;--last&lt;/span&gt; 24h &lt;span class="nt"&gt;--status&lt;/span&gt; error
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output is color-coded Rich tables — green for success, red for errors, yellow for timeouts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pluggable Storage
&lt;/h3&gt;

&lt;p&gt;Storage is behind a protocol:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@runtime_checkable&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RunStore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Protocol&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;RunRecord&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;RunRecord&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SQLite is the default (zero-config, stored at &lt;code&gt;~/.bifrost-monitor/runs.db&lt;/code&gt;). But the protocol means you could plug in PostgreSQL, DynamoDB, or anything else without changing a line of application code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pydantic Models, Not Dicts
&lt;/h3&gt;

&lt;p&gt;Every data structure is a Pydantic model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TokenUsage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;cache_read_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;cache_creation_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="nd"&gt;@property&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;total_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_tokens&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No untyped dictionaries floating around. Pyright strict mode passes with zero errors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three-Index SQLite Schema
&lt;/h3&gt;

&lt;p&gt;The database indexes &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;started_at&lt;/code&gt;, and &lt;code&gt;status&lt;/code&gt; — the three fields you filter on most. Queries stay fast even with thousands of recorded runs.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prompt caching changes the cost math.&lt;/strong&gt; Claude's cache tokens are 10x cheaper than standard input tokens. If you're not tracking cache hit rates, you're probably overestimating your costs. bifrost-monitor tracks &lt;code&gt;cache_read_tokens&lt;/code&gt; and &lt;code&gt;cache_creation_tokens&lt;/code&gt; separately so you can see the real numbers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The decorator pattern is underrated for observability.&lt;/strong&gt; Zero changes to the monitored function. No inheritance, no mixins, no context managers wrapping your code. Just &lt;code&gt;@monitor&lt;/code&gt; and you're done.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Property-based testing catches edge cases you won't think of.&lt;/strong&gt; I used Hypothesis to verify that cost calculations are always non-negative, monotonically increasing with token count, and consistent across cache/non-cache scenarios. Three property tests caught two bugs that unit tests missed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;p&gt;99 tests. 95% coverage. 0.41 seconds.&lt;/p&gt;

&lt;p&gt;The test suite includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unit tests for pricing accuracy (including cache token math)&lt;/li&gt;
&lt;li&gt;Decorator tests for both sync and async functions&lt;/li&gt;
&lt;li&gt;Token extraction tests against mock Anthropic and OpenAI response objects&lt;/li&gt;
&lt;li&gt;Property-based tests (Hypothesis) for cost calculation invariants&lt;/li&gt;
&lt;li&gt;Integration tests for the full decorator → storage → query pipeline&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;bifrost-monitor
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bifrost_monitor&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;monitor&lt;/span&gt;

&lt;span class="nd"&gt;@monitor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;my_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;

&lt;span class="c1"&gt;# Later:
# $ bifrost-monitor costs --group-by model
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full source on &lt;a href="https://github.com/Jbermingham1/bifrost-monitor" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; — MIT licensed, 99 tests, typed with &lt;code&gt;py.typed&lt;/code&gt; marker.&lt;/p&gt;

&lt;p&gt;If you're running AI agents and don't know what they cost, this is the fastest way to find out. One import, five minutes, full visibility.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is the 9th open-source package I've shipped under &lt;a href="https://github.com/Jbermingham1" rel="noopener noreferrer"&gt;github.com/Jbermingham1&lt;/a&gt; — each one solves a specific pain point I hit building AI systems in production.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>opensource</category>
      <category>devtools</category>
    </item>
    <item>
      <title>The Wrapper Trap: Why Most Enterprise AI Projects Fail Before They Start</title>
      <dc:creator>Jarrad Bermingham</dc:creator>
      <pubDate>Thu, 12 Feb 2026 04:10:02 +0000</pubDate>
      <link>https://dev.to/jarradbermingham/the-wrapper-trap-why-most-enterprise-ai-projects-fail-before-they-start-2nla</link>
      <guid>https://dev.to/jarradbermingham/the-wrapper-trap-why-most-enterprise-ai-projects-fail-before-they-start-2nla</guid>
      <description>&lt;p&gt;I've assessed the AI readiness of 4 mid-market enterprises, analyzing 214+ repositories and hundreds of architecture decisions. The same anti-pattern appears in every single one.&lt;/p&gt;

&lt;p&gt;I call it the &lt;strong&gt;Wrapper Trap&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is the Wrapper Trap?
&lt;/h2&gt;

&lt;p&gt;The Wrapper Trap is when a company's "AI initiative" is a thin wrapper around an LLM API — typically OpenAI's chat completions endpoint — with no evaluation, no pipeline architecture, and no data integration.&lt;/p&gt;

&lt;p&gt;It looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The entire "AI feature"
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The company's AI roadmap is a single API call behind a UI.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why It's a Trap
&lt;/h2&gt;

&lt;p&gt;The Wrapper Trap feels productive. You ship something fast. The demo looks impressive. Leadership sees "AI" in the product.&lt;/p&gt;

&lt;p&gt;But three things happen:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. No Evaluation = No Improvement
&lt;/h3&gt;

&lt;p&gt;Without measurement, you can't improve. When every response comes from a black box with no scoring, no retrieval metrics, no user feedback loop — you have no idea if your "AI feature" is working.&lt;/p&gt;

&lt;p&gt;I've seen companies run wrapper-based AI features for 6+ months with zero measurement of answer quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. No Data Integration = No Moat
&lt;/h3&gt;

&lt;p&gt;A wrapper doesn't use your data. It uses OpenAI's training data. Which means any competitor can build the exact same thing in an afternoon.&lt;/p&gt;

&lt;p&gt;The companies that build defensible AI products integrate their proprietary data: customer interactions, domain-specific knowledge bases, internal processes. That requires RAG pipelines, embedding strategies, and evaluation harnesses — not a single API call.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Scaling Costs Explode
&lt;/h3&gt;

&lt;p&gt;Wrappers send the entire context every time. No caching, no chunking, no retrieval optimization. When usage scales 10x, costs scale 10x.&lt;/p&gt;

&lt;p&gt;Production AI systems use vector retrieval to send only relevant context. A well-built RAG pipeline can reduce token costs by 60–80% while improving answer quality.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Other Anti-Patterns
&lt;/h2&gt;

&lt;p&gt;The Wrapper Trap is the most common, but it's not alone. Across 214+ repos, I've identified a consistent pattern set:&lt;/p&gt;

&lt;h3&gt;
  
  
  The Island Problem
&lt;/h3&gt;

&lt;p&gt;AI features built in isolation from each other. Company has 3 teams each building their own OpenAI integration with their own prompt library, their own error handling, and zero shared infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost:&lt;/strong&gt; Duplicated engineering effort, inconsistent user experience, no knowledge sharing.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Prompt-Only Architecture
&lt;/h3&gt;

&lt;p&gt;All intelligence lives in the prompt. No tool use, no retrieval, no structured outputs. When the model changes or the prompt gets too long, everything breaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost:&lt;/strong&gt; Fragile systems that degrade unpredictably with model updates.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Dashboard Trap
&lt;/h3&gt;

&lt;p&gt;Analytics dashboards that report on AI usage (API calls, tokens consumed, cost) but not AI performance (answer quality, user satisfaction, task completion rate).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost:&lt;/strong&gt; Optimizing for the wrong metrics. Cost goes down, value goes down with it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Good Looks Like
&lt;/h2&gt;

&lt;p&gt;The enterprises getting value from AI share common traits:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pipeline architecture, not wrappers.&lt;/strong&gt; Multiple agents with defined roles, shared context, and fault tolerance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation from day one.&lt;/strong&gt; Precision@K, recall, MRR — measured continuously, not as a one-time benchmark.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data integration as a first-class concern.&lt;/strong&gt; Vector stores, chunking strategies, embedding pipelines. Your data is your moat.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared AI infrastructure.&lt;/strong&gt; One team owns the foundation (embedding service, evaluation harness, prompt library). Product teams build on top.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measurable outcomes.&lt;/strong&gt; Not "we added AI" but "answer quality improved 23% while token costs decreased 40%."&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  How to Escape
&lt;/h2&gt;

&lt;p&gt;If you recognize the Wrapper Trap in your organization:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Measure.&lt;/strong&gt; Add evaluation to your existing AI features. Even simple metrics (user thumbs up/down, task completion rate) reveal whether your wrapper is delivering value.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Retrieve.&lt;/strong&gt; Build a retrieval pipeline for your domain data. ChromaDB locally, Pinecone for scale. Ground your AI in your data, not just the base model's training set.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Evaluate.&lt;/strong&gt; Build an evaluation harness. Track Precision@K, Recall@K, MRR. Know whether your retrieval is actually finding the right information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Orchestrate.&lt;/strong&gt; Replace the single API call with a pipeline. Chunking → Retrieval → Generation → Evaluation. Each step measurable, each step improvable.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Assessment Framework
&lt;/h2&gt;

&lt;p&gt;At Bifrost Labs, I built the AI Readiness Scanner to automate this assessment across 8 dimensions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data readiness&lt;/li&gt;
&lt;li&gt;Architecture maturity&lt;/li&gt;
&lt;li&gt;Evaluation capability&lt;/li&gt;
&lt;li&gt;Pipeline sophistication&lt;/li&gt;
&lt;li&gt;Infrastructure (containerization, CI/CD)&lt;/li&gt;
&lt;li&gt;Team capability&lt;/li&gt;
&lt;li&gt;Integration depth&lt;/li&gt;
&lt;li&gt;Governance and monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The methodology behind the scanner identifies these anti-patterns from public signals — repository structure, dependency choices, architecture patterns, and documentation quality.&lt;/p&gt;

&lt;p&gt;The 4 assessments delivered so far have identified $50K–$200K in automation opportunities per company. The biggest wins always come from escaping the Wrapper Trap.&lt;/p&gt;




&lt;p&gt;I assess enterprise AI readiness at &lt;a href="https://github.com/Jbermingham1" rel="noopener noreferrer"&gt;github.com/Jbermingham1&lt;/a&gt;. If you want to know where your organization stands, the assessment starts with your code — not a survey.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>enterprise</category>
      <category>portfolio</category>
      <category>architecture</category>
    </item>
    <item>
      <title>I Built a Framework for Multi-Agent MCP Servers in Python — Here's How</title>
      <dc:creator>Jarrad Bermingham</dc:creator>
      <pubDate>Wed, 11 Feb 2026 13:48:23 +0000</pubDate>
      <link>https://dev.to/jarradbermingham/i-built-a-framework-for-multi-agent-mcp-servers-in-python-heres-how-2m0k</link>
      <guid>https://dev.to/jarradbermingham/i-built-a-framework-for-multi-agent-mcp-servers-in-python-heres-how-2m0k</guid>
      <description>&lt;p&gt;Most MCP servers do one thing: wrap a single API call as a tool. But what if your tool needs multiple AI agents collaborating — analyzing, scoring, and reporting — before returning a result?&lt;br&gt;
That's the problem I solved with agent-mcp-framework, an open-source Python library for building multi-agent MCP servers. Define agents, compose them into pipelines, and expose the whole thing as MCP tools that Claude, VSCode, or any MCP client can call.&lt;br&gt;
Here's how it works and why I built it this way.&lt;/p&gt;

&lt;p&gt;The Problem&lt;br&gt;
I was building an internal tool that analyzes codebases — think automated code review with multiple specialized agents: one for quality issues, one for security vulnerabilities, one for architecture patterns, and one that combines everything into a scored report.&lt;br&gt;
The MCP SDK gives you FastMCP for exposing tools, but there's no built-in way to:&lt;/p&gt;

&lt;p&gt;Define reusable agent abstractions with lifecycle hooks&lt;br&gt;
Compose agents into sequential, parallel, or conditional workflows&lt;br&gt;
Handle errors gracefully across a multi-step pipeline&lt;br&gt;
Format results consistently&lt;/p&gt;

&lt;p&gt;I needed infrastructure. So I built it.&lt;/p&gt;

&lt;p&gt;The Architecture&lt;br&gt;
Agent (unit of work)&lt;br&gt;
  → Pipeline (composition pattern)&lt;br&gt;
    → AgentMCPServer (MCP exposure layer)&lt;br&gt;
Three layers, each doing one thing.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Agents — The Building Blocks
Every agent subclasses Agent and implements run():
pythonfrom agent_mcp_framework import Agent, AgentContext, AgentResult&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;class SecurityScanner(Agent):&lt;br&gt;
    async def run(self, context: AgentContext) -&amp;gt; AgentResult:&lt;br&gt;
        code = context.get("code", "")&lt;br&gt;
        findings = []&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    dangerous_patterns = ["eval(", "exec(", "os.system("]
    for pattern in dangerous_patterns:
        if pattern in code:
            findings.append(f"Found {pattern} — potential injection risk")

    context.set("security_findings", findings)
    return AgentResult(
        success=True,
        output={"findings": findings, "count": len(findings)},
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The AgentContext is a shared data store that agents read from and write to. Each agent is self-contained — it pulls what it needs, does its work, and pushes results back.&lt;br&gt;
There are three agent types:&lt;/p&gt;

&lt;p&gt;Agent — subclass and implement run()&lt;br&gt;
LLMAgent — built-in Anthropic client for Claude-powered agents&lt;br&gt;
FunctionAgent — wrap any async function without subclassing&lt;/p&gt;

&lt;p&gt;All agents get lifecycle hooks (before_run, after_run, on_error), automatic timing, and error handling for free.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pipelines — The Composition Layer
This is where it gets interesting. Four patterns:
Sequential — agents run one after another, each seeing the updated context:
pythonfrom agent_mcp_framework import SequentialPipeline&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;pipeline = SequentialPipeline("review", agents=[&lt;br&gt;
    QualityAnalyzer("quality"),&lt;br&gt;
    SecurityScanner("security"),&lt;br&gt;
    ReportGenerator("reporter"),&lt;br&gt;
])&lt;br&gt;
Parallel — agents run concurrently with isolated context copies (no race conditions), merged back after completion:&lt;br&gt;
pythonfrom agent_mcp_framework import ParallelPipeline&lt;/p&gt;

&lt;h1&gt;
  
  
  All three analyze simultaneously, results merge
&lt;/h1&gt;

&lt;p&gt;analysis = ParallelPipeline("analysis", agents=[&lt;br&gt;
    QualityAnalyzer("quality"),&lt;br&gt;
    SecurityScanner("security"),&lt;br&gt;
    ArchitectureReviewer("architecture"),&lt;br&gt;
], max_concurrency=3)&lt;br&gt;
Conditional — route to different agents based on context:&lt;br&gt;
pythonfrom agent_mcp_framework import ConditionalPipeline&lt;/p&gt;

&lt;p&gt;def router(ctx):&lt;br&gt;
    if ctx.get("language") == "python":&lt;br&gt;
        return "python-analyzer"&lt;br&gt;
    return "generic-analyzer"&lt;/p&gt;

&lt;p&gt;pipeline = ConditionalPipeline("route", agents=[&lt;br&gt;
    PythonAnalyzer("python-analyzer"),&lt;br&gt;
    GenericAnalyzer("generic-analyzer"),&lt;br&gt;
], router=router)&lt;br&gt;
MapReduce — split work across agents, then reduce:&lt;br&gt;
pythonfrom agent_mcp_framework import MapReducePipeline, AgentContext&lt;/p&gt;

&lt;p&gt;pipeline = MapReducePipeline("batch",&lt;br&gt;
    agents=[FileAnalyzer(f"worker-{i}") for i in range(4)],&lt;br&gt;
    splitter=lambda ctx: [&lt;br&gt;
        AgentContext(data={"file": f}) for f in ctx.get("files")&lt;br&gt;
    ],&lt;br&gt;
    reducer=lambda results, ctx: ctx.set(&lt;br&gt;
        "all_results", [r.output for r in results]&lt;br&gt;
    ),&lt;br&gt;
)&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;MCP Server — The Exposure Layer
One line to turn any pipeline into an MCP tool:
pythonfrom agent_mcp_framework import AgentMCPServer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;server = AgentMCPServer("code-review", description="Multi-agent code review")&lt;br&gt;
server.add_pipeline_tool(&lt;br&gt;
    pipeline,&lt;br&gt;
    name="review_code",&lt;br&gt;
    description="Analyze code for quality, security, and architecture issues.",&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;server.run()  # Starts MCP server on stdio&lt;br&gt;
Now any MCP client can call review_code and get a multi-agent analysis back.&lt;/p&gt;

&lt;p&gt;Design Decision: Context Isolation in Parallel Pipelines&lt;br&gt;
The trickiest part was parallel execution. When multiple agents run concurrently on the same context, you get race conditions — two agents writing to the same key, lost updates, stale reads.&lt;br&gt;
My solution: each parallel agent gets a deep copy of the context. After all agents complete, their contexts merge back into the original. This means:&lt;/p&gt;

&lt;p&gt;No locks, no mutexes, no shared mutable state&lt;br&gt;
Each agent writes freely without stepping on others&lt;br&gt;
The merge is deterministic (last-write-wins per key)&lt;/p&gt;

&lt;p&gt;python# Inside ParallelPipeline.execute():&lt;br&gt;
snapshots = [ctx.model_copy(deep=True) for _ in self.agents]&lt;/p&gt;

&lt;p&gt;results = await asyncio.gather(&lt;br&gt;
    *[a.execute(s) for a, s in zip(self.agents, snapshots)]&lt;br&gt;
)&lt;/p&gt;

&lt;h1&gt;
  
  
  Merge back
&lt;/h1&gt;

&lt;p&gt;for snap in snapshots:&lt;br&gt;
    ctx.data.update(snap.data)&lt;br&gt;
Simple, correct, no surprises.&lt;/p&gt;

&lt;p&gt;Real-World Use Case: Code Review Server&lt;br&gt;
The repo includes a complete code review server example with four agents:&lt;/p&gt;

&lt;p&gt;QualityAnalyzer — checks line length, wildcard imports, missing docstrings&lt;br&gt;
SecurityScanner — detects eval(), exec(), os.system(), pickle.loads()&lt;br&gt;
ArchitectureReviewer — flags too many classes, global state, deep nesting&lt;br&gt;
ReportGenerator — combines findings into a scored report (A through F)&lt;/p&gt;

&lt;p&gt;The analysis agents run in parallel (they're independent), then the report generator runs sequentially (it needs all findings).&lt;br&gt;
Here's what a scan of insecure code produces:&lt;br&gt;
json{&lt;br&gt;
  "score": 52,&lt;br&gt;
  "grade": "C",&lt;br&gt;
  "quality": {"count": 3, "issues": ["..."]},&lt;br&gt;
  "security": {"count": 1, "findings": ["eval() — potential code injection"]},&lt;br&gt;
  "architecture": {"count": 1, "notes": ["Global state detected"]}&lt;br&gt;
}&lt;br&gt;
I've used this same pattern to build internal tools that analyze entire repositories — scanning tech stacks, detecting anti-patterns, and producing readiness assessments. The framework handles the orchestration; the domain logic lives in the agents.&lt;/p&gt;

&lt;p&gt;What I'd Build Next&lt;br&gt;
The framework is intentionally minimal right now — agents, pipelines, MCP server. Things I'm considering:&lt;/p&gt;

&lt;p&gt;Agent-to-agent messaging — let agents communicate mid-pipeline&lt;br&gt;
Retry policies — configurable retry with backoff for flaky LLM calls&lt;br&gt;
Streaming results — progressive output as agents complete&lt;br&gt;
Pipeline visualization — render the DAG of agent dependencies&lt;/p&gt;

&lt;p&gt;Try It&lt;br&gt;
bashpip install agent-mcp-framework&lt;br&gt;
pythonfrom agent_mcp_framework import Agent, AgentContext, AgentResult, SequentialPipeline&lt;/p&gt;

&lt;p&gt;class MyAgent(Agent):&lt;br&gt;
    async def run(self, context: AgentContext) -&amp;gt; AgentResult:&lt;br&gt;
        data = context.get("input", "")&lt;br&gt;
        return AgentResult(success=True, output=f"Processed: {data}")&lt;/p&gt;

&lt;p&gt;pipeline = SequentialPipeline("demo", agents=[MyAgent("worker")])&lt;br&gt;
80 tests. Zero lint errors. Typed with py.typed marker. MIT licensed.&lt;/p&gt;

&lt;p&gt;GitHub: github.com/Jbermingham1/agent-mcp-framework&lt;br&gt;
PyPI: pypi.org/project/agent-mcp-framework&lt;/p&gt;

&lt;p&gt;If you're building multi-agent systems with MCP, I'd love to hear how you're approaching composition and orchestration. Drop a comment or open an issue on the repo.&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>mcp</category>
      <category>agents</category>
    </item>
    <item>
      <title>214 Repos, Zero ML: What Public Signals Reveal About Mid-Market SaaS AI Strategy</title>
      <dc:creator>Jarrad Bermingham</dc:creator>
      <pubDate>Tue, 10 Feb 2026 08:17:00 +0000</pubDate>
      <link>https://dev.to/jarradbermingham/214-repos-zero-ml-what-public-signals-reveal-about-mid-market-saas-ai-strategy-34m8</link>
      <guid>https://dev.to/jarradbermingham/214-repos-zero-ml-what-public-signals-reveal-about-mid-market-saas-ai-strategy-34m8</guid>
      <description>&lt;p&gt;&lt;strong&gt;I analyzed the AI capabilities of 4 mid-market SaaS companies using only public data. Three distinct failure patterns emerged.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Experiment&lt;br&gt;
Over the past week, I ran public-signal analyses on 4 mid-market SaaS companies in the HR tech space. Each company positions AI as a core capability. Each has shipped multiple AI features. Each markets AI as their competitive differentiator.&lt;br&gt;
I wanted to know: does the public engineering signal match the marketing?&lt;br&gt;
The companies (anonymized):&lt;br&gt;
CompanyRevenueEmployeesCustomersAI Features ShippedCompany A~$200M+ ARR900+5,000+3+ (analytics, summaries)Company B~$25-50M~1361,500+6 (copilot, review wizard, meeting coach)Company C~$75-100M~2873,500+6+ (predictive model, AI coach, reviews)Company D~$50-57M~424-5353,000+5+ (predictive analytics, AI scheduling, gen-AI)&lt;br&gt;
Combined: ~$250-300M revenue. ~1,600 employees. 10,000+ customers.&lt;br&gt;
The Signal: GitHub&lt;br&gt;
Public GitHub repositories are an imperfect but meaningful signal for engineering capability. They show what a company's engineering team builds, values, and invests in.&lt;br&gt;
Across all 4 companies:&lt;/p&gt;

&lt;p&gt;Total public repos: 214 (178 + 5 + 29 + 2)&lt;br&gt;
Repos involving machine learning: 0&lt;br&gt;
Repos involving data science: 0&lt;br&gt;
Repos involving model training: 0&lt;/p&gt;

&lt;p&gt;Company A has 178 repos — all infrastructure, deployment tooling, and frontend libraries. Company C has 29 repos — all Django utilities, CI/CD, and API integrations. Company B has 5 repos — all forks of third-party libraries. Company D has 2 repos — one hackathon project and one webhook example.&lt;br&gt;
Not a single company has a public repository that touches ML, data science, or model training.&lt;br&gt;
Three Failure Patterns&lt;br&gt;
Pattern 1: The Wrapper Trap&lt;br&gt;
Seen in: Companies A, B (most clearly)&lt;br&gt;
The Wrapper Trap is the most common pattern: shipping AI features built on managed LLM APIs (OpenAI, Claude, etc.) and marketing them as competitive differentiation.&lt;br&gt;
The problem: if your AI is a wrapper on someone else's model, your competitor ships the same feature in weeks using the same API. There's no moat. There's no differentiation. There's a press release.&lt;br&gt;
Company B (136 employees) has shipped 6 AI features in a year using this pattern. Which means a competitor with 600 employees can ship the same features in a quarter. The advantage isn't speed — it's the data. But the data is being used for dashboards, not training.&lt;br&gt;
Pattern 2: AI by Acquisition&lt;br&gt;
Seen in: Companies C, D (most clearly)&lt;br&gt;
When companies recognize they need AI capability faster than they can build it, they acquire. Company C acquired an AI coaching startup. Company D acquired an AI-powered scheduling company.&lt;br&gt;
The strategy is understandable. The risk is structural.&lt;br&gt;
Post-acquisition engineering attrition averages 33% within 18 months. If the acquired team leaves and the acquiring company has zero ML repos and no ML hiring, they've paid for capability they can't maintain.&lt;br&gt;
Company C has 29 GitHub repos — zero ML — and recently acquired an AI startup. Their engineering team has contracted ~35% in the past year. Who maintains the AI if the acquired engineers leave?&lt;br&gt;
Pattern 3: The Island Problem&lt;br&gt;
Seen in: Company D (most clearly)&lt;br&gt;
The Island Problem appears in companies that have solved the Wrapper Trap through multiple acquisitions or partnerships. They have genuine AI assets — but those assets can't talk to each other.&lt;br&gt;
Company D has:&lt;/p&gt;

&lt;p&gt;A university R&amp;amp;D partnership producing predictive ML models&lt;br&gt;
An acquired startup running AI-powered scheduling algorithms&lt;br&gt;
Core platform gen-AI features via LLM APIs&lt;/p&gt;

&lt;p&gt;Three AI engines. Three different tech stacks. Three separate data models. Zero cross-pollination.&lt;br&gt;
The scheduling AI can't learn from engagement data. The predictive models can't improve scheduling. The LLM features can't leverage either model's intelligence.&lt;br&gt;
The whole is less than the sum of parts.&lt;br&gt;
The Data Moat Paradox&lt;br&gt;
Here's the most striking finding: the companies sitting on the richest proprietary data are the worst at using it for AI.&lt;br&gt;
Combined, these 4 companies have:&lt;/p&gt;

&lt;p&gt;Billions of proprietary data points (survey responses, performance reviews, scheduling patterns, learning completions)&lt;br&gt;
Decades of domain expertise&lt;br&gt;
Millions of end users generating continuous data&lt;/p&gt;

&lt;p&gt;All of it is being used for dashboards, analytics, and reports. None of it is being used for model training, fine-tuning, or domain-specific intelligence.&lt;br&gt;
The data moats exist. The ML engineering to exploit them doesn't.&lt;br&gt;
What Would Fix This&lt;br&gt;
The fix isn't more AI features. It's three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;One ML engineer.
A single ML engineer changes the trajectory. They can evaluate whether "predictive models" are genuine ML or regression-with-marketing. They can fine-tune a model on proprietary data. They can assess whether an acquisition's AI is maintainable.&lt;/li&gt;
&lt;li&gt;Data as training asset, not storage problem.
Restructuring data pipelines to support model training — not just dashboards — is a one-time investment that compounds indefinitely. Every survey response, every performance review, every scheduling outcome becomes training data for models competitors can't replicate.&lt;/li&gt;
&lt;li&gt;Integration over acquisition.
Before the next AI acquisition, invest in connecting existing AI assets. Make the scheduling AI learn from the engagement data. Make the predictive model inform the coaching recommendations. Integration creates compound value; isolated acquisition creates diminishing returns.
The Gap Is Widening
The gap between AI marketing and AI engineering in mid-market SaaS isn't closing. It's widening. Companies are shipping more AI features while building less ML capability.
Right now, in February 2026, I'm not seeing mid-market SaaS companies closing this gap on their own. The data moats are there. The engineering to exploit them isn't.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Methodology: All analysis based on public signals — GitHub repositories, job postings, product pages, press coverage, and financial data. No proprietary information was accessed. All companies anonymized.&lt;br&gt;
I'm Jarrad Bermingham — I build production AI agent systems and open-source developer tooling at Bifrost Labs. Find our tools @bifrostlabs on npm.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>saas</category>
      <category>machinelearning</category>
      <category>beginners</category>
    </item>
    <item>
      <title>I Built a 13-Agent AI System That Reviews Its Own Decisions. Here's the Architecture.</title>
      <dc:creator>Jarrad Bermingham</dc:creator>
      <pubDate>Mon, 09 Feb 2026 13:15:25 +0000</pubDate>
      <link>https://dev.to/jarradbermingham/i-built-a-13-agent-ai-system-that-reviews-its-own-decisions-heres-the-architecture-pbd</link>
      <guid>https://dev.to/jarradbermingham/i-built-a-13-agent-ai-system-that-reviews-its-own-decisions-heres-the-architecture-pbd</guid>
      <description>&lt;p&gt;Most people use Claude Code to write functions. I built a system where 13 specialized AI agents coordinate, challenge each other, and collectively make better decisions than any single agent could alone.&lt;br&gt;
This isn't a weekend prototype. It runs daily. It has 141 tests across two shipped npm packages. It once scored a business opportunity 88/100 — without running a single web search. That failure, and six others like it, shaped every design decision you'll read below.&lt;br&gt;
Here's the full architecture: routing engine, adversarial verification, lifecycle hooks, memory system, and the specific failures that made each one necessary.&lt;/p&gt;

&lt;p&gt;Why Multi-Agent?&lt;br&gt;
Single-agent AI has a ceiling, and it's lower than most people think.&lt;br&gt;
Ask one agent to research a market, build the product, AND evaluate whether it's worth building — you get confirmation bias baked into every step. The agent that researched will defend its findings. The agent that built will justify its architecture. The agent that evaluated will anchor on work already done.&lt;br&gt;
I hit this wall repeatedly. My single-agent setup would produce 2,000 words of reasoning explaining why a strategy was brilliant, then I'd spend 10 minutes on Google and find three competitors doing it better. The reasoning was airtight. The premises were wrong.&lt;br&gt;
Multi-agent orchestration solves this through separation of concerns:&lt;/p&gt;

&lt;p&gt;Specialists handle what they're good at&lt;br&gt;
Adversaries exist solely to find flaws&lt;br&gt;
A coordinator synthesizes without getting attached to any one perspective&lt;/p&gt;

&lt;p&gt;The result: better decisions, caught earlier, with a paper trail of why.&lt;/p&gt;

&lt;p&gt;The Architecture&lt;br&gt;
User (CEO) → "What" — business intent&lt;br&gt;
      ↓&lt;br&gt;
Lead Agent (CTO) → "How" — all technical decisions&lt;br&gt;
      ↓&lt;br&gt;
  ┌─────────────┬─────────────┬──────────────┐&lt;br&gt;
  │  Routing:   │             │              │&lt;br&gt;
  │  Skill?     │  Solo?      │  Subagent?   │  Agent Team?&lt;br&gt;
  ↓             ↓             ↓              ↓&lt;br&gt;
Skills /cmd   Solo Execute  Static Agents  Dynamic Teams&lt;br&gt;
(19 skills)   (&amp;lt; 2 min)    (1-3 agents)   (custom composition&lt;br&gt;
                                           + mandatory adversary)&lt;br&gt;
The critical design decision: the lead agent never codes directly on complex tasks. It classifies the request, composes the right team, delegates, and synthesizes. The coordinator coordinates. The specialists specialize.&lt;br&gt;
This sounds obvious until you build it. My first instinct was having the lead agent "help out" when it knew the answer. That creates a god-agent that subtly biases team output because it already has an opinion before the specialists even start. Forcing strict delegation eliminated an entire class of coordination bugs.&lt;/p&gt;

&lt;p&gt;The Routing Engine&lt;br&gt;
Every request hits a four-level decision tree before any work begins:&lt;br&gt;
→ Existing skill handles it?                    → DELEGATE (19 skills)&lt;br&gt;
→ Trivial (&amp;lt; 2 min, well-defined)?              → SOLO (lead executes directly)&lt;br&gt;
→ Moderate (2-5 steps, single domain)?           → SPAWN 1-3 specialist agents&lt;br&gt;
→ Complex (5+ steps OR cross-domain)?            → AGENT TEAM (dynamic + mandatory DA)&lt;br&gt;
Routing matters because agent overhead is real. Spawning a 3-agent team to rename a variable wastes tokens, time, and context window. Running solo on a strategic decision is dangerous — you're back to single-agent confirmation bias.&lt;br&gt;
The router's job is proportional response. Match the complexity of the tool to the complexity of the task.&lt;br&gt;
One rule I learned the hard way: when in doubt, route UP, not down. Treating a complex task as moderate is far more costly than treating a moderate task as complex. Over-routing wastes tokens. Under-routing wastes decisions.&lt;/p&gt;

&lt;p&gt;The 13 Static Agents&lt;br&gt;
Each agent has a defined role, model tier, toolset, and memory scope:&lt;br&gt;
AgentFunctionModelDevil's AdvocateFind flaws, score proposals, kill bad ideasOpusStrategic AdvisorHigh-level strategy, market positioningOpusOpponent ModelerGame theory, competitive analysisOpusResearcherWeb research, data gathering, market analysisSonnetVerifierQuality gates, completion validationSonnetTest RunnerExecute and validate test suitesSonnetDebuggerRoot cause analysis, error diagnosisSonnetCode ReviewerArchitecture review, anti-patterns, code qualitySonnetSecurity AuditorVulnerability detection, dependency risksSonnetSession DistillerCompress session learnings for future contextSonnetUpgrade AnalystDependency analysis, breaking change detectionSonnetPR ReviewerPull request quality, merge readinessSonnetRLM ProcessorRecursive reasoning, iterative refinementSonnet&lt;br&gt;
Notice the model distribution: only 3 agents run on Opus (the reasoning-heavy ones), and the rest use Sonnet. This wasn't the original design — I'll explain why in the Failures section.&lt;br&gt;
Static agents handle solo specialist delegation. For complex tasks, the system composes dynamic teams — fresh agents with custom system prompts tailored to the specific problem. No two teams look the same, because no two complex problems have the same shape.&lt;/p&gt;

&lt;p&gt;The Part That Changed Everything: Adversarial Verification&lt;br&gt;
This is the section I wish someone had written before I learned it the hard way.&lt;br&gt;
Early in the project, I had the system evaluate a business opportunity. The DA agent ran its protocol, synthesized findings, and delivered a score: 88/100. Strong proceed. Compelling reasoning. Specific recommendations.&lt;br&gt;
It hadn't run a single web search.&lt;br&gt;
The 88 was sophisticated-sounding analysis with a confidence number attached. No competitor research. No market validation. No external data of any kind. Just... vibes with decimal points.&lt;br&gt;
I almost shipped a strategy based on that score. That near-miss became the most important design constraint in the entire system.&lt;br&gt;
4 Tiers of Adversarial Depth&lt;br&gt;
L3: MULTI-ADVERSARIAL ──── Devil's Advocate + Contrarian + Pre-Mortem&lt;br&gt;
                           For: Strategic, irreversible decisions&lt;/p&gt;

&lt;p&gt;L2: FULL DA PROTOCOL ───── 5-phase: Claims → Verify → Belief Gap → Pre-Mortem → Score&lt;br&gt;
                           For: Complex tasks, new builds, significant resource commitments&lt;/p&gt;

&lt;p&gt;L1: QUICK CHALLENGE ────── 3 adversarial questions before output delivery&lt;br&gt;
                           For: Moderate tasks, recommendations, estimates&lt;/p&gt;

&lt;p&gt;L0: SELF-CHECK ─────────── 3 assumption checks before ANY output (always active)&lt;br&gt;
                           For: Everything — no exceptions, no override&lt;br&gt;
L0: The Foundation (Always Active)&lt;br&gt;
L0 is embedded in every agent's system prompt. Before delivering any output, every agent silently runs three checks:&lt;/p&gt;

&lt;p&gt;What assumption am I making that I haven't verified?&lt;br&gt;
What's the strongest argument against my conclusion?&lt;br&gt;
What would I be wrong about if challenged by a domain expert?&lt;/p&gt;

&lt;p&gt;This costs almost nothing — three questions before each response, no external calls. But it catches unverified assumptions at the source, before they propagate through multi-agent handoffs where they become much harder to trace.&lt;br&gt;
L2: Where It Gets Interesting&lt;br&gt;
The Devil's Advocate agent runs a 5-phase protocol:&lt;br&gt;
Phase 1: Claim Extraction — Identify every falsifiable claim in the proposal. Not opinions, not framing — specific claims that can be tested against reality.&lt;br&gt;
Phase 2: Adversarial Verification — Search for contradicting evidence. Here's the key constraint: 60%+ of searches must seek disconfirmation. The default LLM behavior is to search for supporting evidence. You have to explicitly force the opposite. The search queries aren't "why X is a good idea" but "why X fails," "X competitors better than," "problems with X approach."&lt;br&gt;
Phase 3: Belief Gap Analysis — What does the team wish were true vs. what is true? This catches motivated reasoning — the gap between the conclusion you want and the evidence you have.&lt;br&gt;
Phase 4: Pre-Mortem — "It's 6 months later and this failed. Why?" Generate 5 independent failure scenarios. This reframes evaluation from "will this work?" (optimism bias) to "how could this fail?" (much more productive).&lt;br&gt;
Phase 5: Scoring — 0-100 weighted rubric across evidence quality, market fit, execution feasibility, and risk factors. Score 70+ to proceed, 50-69 to refine, below 50 to kill.&lt;br&gt;
The Double-DA Rule&lt;br&gt;
After the 88/100 incident, I added a safeguard: any score above 80 automatically triggers a second, independent DA run. The second evaluator has no access to the first's findings, reasoning, or score.&lt;br&gt;
The reconciliation logic:&lt;/p&gt;

&lt;p&gt;If both scores are within 10 points → higher score stands&lt;br&gt;
If the gap is larger than 10 → the lower score wins&lt;/p&gt;

&lt;p&gt;The reasoning: if two independent evaluations diverge by more than 10 points, the optimistic run missed something the skeptical run caught. Defaulting to the lower score builds in a systematic pessimism bias for high-confidence assessments — which is exactly where overconfidence is most dangerous.&lt;br&gt;
This rule has killed three initiatives that would have wasted months of development time.&lt;br&gt;
7 Mandatory Research Gates&lt;br&gt;
The DA can't score a proposal until it passes all seven:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;At least 3 web searches executed (no armchair analysis)&lt;/li&gt;
&lt;li&gt;At least 1 search explicitly seeking contradicting evidence&lt;/li&gt;
&lt;li&gt;Competitor/alternative analysis included (minimum 2 alternatives)&lt;/li&gt;
&lt;li&gt;Market size claim backed by external source (not LLM reasoning)&lt;/li&gt;
&lt;li&gt;Technical feasibility verified against real constraints&lt;/li&gt;
&lt;li&gt;"Who else has tried this?" check completed&lt;/li&gt;
&lt;li&gt;First-principles cost estimate included
If any gate fails, the DA cannot issue a score. It must report which gates failed and what information is missing. This makes "confident ignorance" structurally impossible.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Lifecycle Hooks: Automated Quality Enforcement&lt;br&gt;
Claude Code supports lifecycle hooks — shell scripts that trigger on specific events. I use 10 of them to enforce quality gates that no agent can bypass:&lt;br&gt;
HookTriggerPurposesession-start.shSession beginsLoad previous context + memorypre-compact.shBefore context compressionSave session state before data losssession-end.shSession endsPersist learnings, distill sessionverify-before-complete.shBefore task completionBlock premature completionteammate-idle-check.shAgent goes idleForce DA verdict deliverytask-completed-gate.shTask marked doneLog metrics, update pipelinevalidate-search-quality.shAfter web searchesEnforce research depth minimums&lt;br&gt;
The two most important hooks solve specific failure modes I hit repeatedly:&lt;br&gt;
verify-before-complete.sh blocks any task from being marked complete until the verifier agent has signed off. Without this, agents declare victory the moment code compiles. "It works" and "it's correct" are different statements — this hook enforces the distinction.&lt;br&gt;
teammate-idle-check.sh catches a subtler problem: the Devil's Advocate going idle before delivering a verdict. In multi-agent teams, the DA reads other agents' outputs and then... does nothing. It "participated" without actually challenging anything. This hook detects when the DA hasn't delivered a written verdict and forces one before the team can proceed.&lt;br&gt;
These hooks are the immune system. They don't make the system smarter — they prevent specific, known failure modes from recurring.&lt;/p&gt;

&lt;p&gt;Skills: The Fast Path&lt;br&gt;
19 skills act as composable workflows invoked via slash commands. Each is self-contained with its own execution logic:&lt;br&gt;
/tdd                  → Test-driven development (failing test → minimal code → refactor)&lt;br&gt;
/auto-orchestrate     → Classify task complexity, compose optimal agent team&lt;br&gt;
/devils-advocate      → Full L2 adversarial protocol&lt;br&gt;
/research             → Web research with verification gates&lt;br&gt;
/council-of-winners   → Elite strategy: power plays, asymmetric upside identification&lt;br&gt;
/prospect-scan        → Company AI maturity assessment (10-point rubric)&lt;br&gt;
/commit-push-pr       → Git workflow: branch → commit → push → PR&lt;br&gt;
/self-audit           → Full-spectrum system health check&lt;br&gt;
/produce-deliverable  → End-to-end client deliverable pipeline&lt;br&gt;
/distribute           → Generate platform-specific content from any session&lt;br&gt;
Skills are the routing engine's fast path. When a request maps cleanly to an existing skill, there's zero routing overhead — the lead agent recognizes the pattern and delegates directly. The router checks for skill matches first, before evaluating whether to spawn agents.&lt;br&gt;
After 3+ successful uses of the same workflow pattern, the system identifies candidates for new skills — turning repeated multi-step processes into one-command invocations.&lt;/p&gt;

&lt;p&gt;The Memory System&lt;br&gt;
Multi-agent systems have a context problem. Each agent starts fresh. But institutional knowledge — past decisions, known failure modes, project context — needs to persist without bloating every session's context window.&lt;br&gt;
Three-layer solution:&lt;br&gt;
Layer 1: Auto-loaded (every session, ~4KB budget)&lt;br&gt;
Core behavior rules, priority stack, active operations summary, agent descriptions (for routing decisions). This loads on every session start via the session-start.sh hook.&lt;br&gt;
The budget matters. I enforce hard limits:&lt;/p&gt;

&lt;p&gt;Core config: &amp;lt; 80 lines&lt;br&gt;
Memory summary: &amp;lt; 50 lines&lt;br&gt;
Total auto-load: &amp;lt; 4KB&lt;/p&gt;

&lt;p&gt;Without these limits, memory files grow unbounded. A 20KB auto-load eats 15% of your context window before you've typed a single prompt. That's not a cleanup task — it's an architectural constraint.&lt;br&gt;
Layer 2: On-demand (loaded when relevant)&lt;br&gt;
46 reference documents covering orchestration patterns, adversarial depth protocols, model selection guides, strategic positioning, and past operation analyses. Only loaded when the current task requires that specific knowledge.&lt;br&gt;
The lead agent's routing decision includes identifying which reference docs the spawned agents will need. A code review task loads the architecture standards. A strategic decision loads the competitive analysis and failure library. A research task loads the verification protocols.&lt;br&gt;
Layer 3: Persistent (across sessions)&lt;br&gt;
Per-agent isolated memory directories plus session distillation files — compressed learnings from previous sessions generated by the Session Distiller agent. The session-end.sh hook triggers distillation automatically: what was decided, what failed, what should inform future sessions.&lt;br&gt;
This creates a feedback loop: sessions produce learnings → learnings load into future sessions → future sessions build on past context without re-deriving it.&lt;/p&gt;

&lt;p&gt;What I've Shipped With This System&lt;br&gt;
The architecture runs in production and has produced real, published artifacts:&lt;br&gt;
Cost Guardian (@bifrostlabs/cost-guardian on npm) — Real-time token cost tracking for Claude Code sessions. Tracks spend per agent, per session, with budget alerts and cost breakdowns. 62 tests.&lt;br&gt;
Claude Shield (@bifrostlabs/claude-shield on npm) — Security lifecycle hooks that block destructive commands before they execute. Pattern matching against known dangerous operations with configurable severity levels. 79 tests.&lt;br&gt;
Both packages built using TDD (the /tdd skill), verified by the adversarial system, and published through the /commit-push-pr automated workflow.&lt;/p&gt;

&lt;p&gt;The 7 Failures That Shaped the Architecture&lt;br&gt;
I'm sharing these in detail because the architecture only makes sense in light of the problems it solved. Every guardrail exists because something specific broke.&lt;br&gt;
Failure 1: The 88/100 Score With Zero Research&lt;br&gt;
What happened: DA evaluated a business opportunity. Produced 2,000 words of analysis. Scored 88/100. Had not executed a single web search. No competitor data. No market validation.&lt;br&gt;
Root cause: The DA protocol was reasoning-only. It could construct sophisticated arguments entirely from the LLM's training data and pattern matching. No requirement for external evidence.&lt;br&gt;
Fix: 7 mandatory research gates. 60%+ of searches must seek contradicting evidence. No score can be issued until all gates pass.&lt;br&gt;
Failure 2: Panic-Pivoting on New Information&lt;br&gt;
What happened: Competitive intelligence arrived showing a well-funded competitor in the space. The system immediately recommended abandoning the entire strategy — not adjusting the approach, but wholesale pivot.&lt;br&gt;
Root cause: No distinction between "this new info invalidates our thesis" and "this new info invalidates specific tactics." The system treated all threatening information as existential.&lt;br&gt;
Fix: Anti-pivot rule. New intel triggers a thesis-vs-tactics triage: Does this contradict why we're doing this (thesis) or how we're doing it (tactics)? Thesis invalidation requires L3 review. Tactics adjustment requires L1. Most "pivots" are actually tactical adjustments that don't require strategic rethinking.&lt;br&gt;
Failure 3: Building the Zero-Revenue Component&lt;br&gt;
What happened: The system had three workstreams: free tools, content distribution, and a revenue-generating service platform. 100% of execution went to free tools. Weeks of development, zero progress on anything that would generate income.&lt;br&gt;
Root cause: The router treated all "build" tasks equally. It didn't distinguish between building a free open-source tool and building the paid service that sustains the business. Both looked like "coding tasks" to the routing engine.&lt;br&gt;
Fix: Revenue reality gate in the DA protocol. Before any workstream gets resources, the DA asks: "Does this directly lead to revenue within 90 days? If not, what's the explicit theory for how it converts to revenue later?"&lt;br&gt;
Failure 4: Burning 50% of the Weekly Token Budget in One Session&lt;br&gt;
What happened: Spawned 5 Opus-tier agents for research tasks. Each running full context, full reasoning, full analysis. The session cost more than the previous week combined.&lt;br&gt;
Root cause: The model selection guide existed as a reference doc but wasn't enforced. Agents defaulted to the most capable model because nothing stopped them.&lt;br&gt;
Fix: Hard constraints: maximum 3 parallel agents, maximum 1 Opus agent per team, research agents always use Sonnet. These aren't guidelines — they're enforced by the routing engine.&lt;br&gt;
Failure 5: Strategy Oscillation&lt;br&gt;
What happened: Six strategic directions in four months. Each one researched, architected, partially built, then abandoned when the next "better" idea emerged. Zero revenue from any of them.&lt;br&gt;
Root cause: No commitment mechanism. Every new analysis could trigger a full strategic pivot. The system was optimized for evaluating strategies, not for executing them.&lt;br&gt;
Fix: Strategy Lock — a config file that requires explicit CEO override to change strategic direction. The DA can recommend adjustments, but wholesale pivots require human intervention. The lock has held for the current strategy. Override count: 0.&lt;br&gt;
Failure 6: Absence as Evidence&lt;br&gt;
What happened: System searched for competitors on six platforms. Found nothing. Concluded: "zero competition, massive opportunity." Every platform actually had multiple established competitors — the searches just used the wrong queries.&lt;br&gt;
Root cause: Treating "I didn't find it" as "it doesn't exist." The system didn't distinguish between exhaustive search and unsuccessful search.&lt;br&gt;
Fix: DA Failure Library with pattern matching. "Absence ≠ evidence" is now a named pattern. When a search returns zero results, the system flags it for manual verification and tries alternative search queries before drawing conclusions.&lt;br&gt;
Failure 7: Survivor Bias in Success Stories&lt;br&gt;
What happened: System researched solo consulting success stories to validate the business model. Found dozens. Concluded the model was highly viable.&lt;br&gt;
Root cause: It only found success stories because failures don't write blog posts. The actual solo consulting failure rate is ~80% within the first year.&lt;br&gt;
Fix: DA protocol now requires searching for failure rates alongside success stories. Any market validation must include base rate data, not just examples of people who made it.&lt;br&gt;
Each failure became a permanent entry in the DA Failure Library — a pattern-matching system that checks new proposals against past mistakes. The system doesn't just learn from failures in the abstract; it maintains a structured database of exactly how it failed and checks whether new proposals exhibit the same patterns.&lt;/p&gt;

&lt;p&gt;Key Lessons&lt;br&gt;
Separate evaluation from execution. The agent that builds something will defend it. A separate adversary catches what the builder won't see. This isn't just good practice — it's the single highest-leverage architectural decision I made.&lt;br&gt;
Enforce verification, don't suggest it. Having a DA protocol didn't help when it was optional. The verify-before-complete hook, the mandatory research gates, the teammate-idle-check — these work because they're structural, not cultural. An agent can't skip them.&lt;br&gt;
Compose teams dynamically. Every complex task is different. Composing fresh agent teams with task-specific system prompts outperforms recycling the same agent template. The overhead of writing a custom prompt is trivial compared to the cost of a misfit team.&lt;br&gt;
Context discipline is architecture, not cleanup. Without size limits on auto-loaded memory, context bloat degrades everything — reasoning quality, response speed, token cost. The 4KB budget and the 80-line config limit are design decisions, not afterthoughts.&lt;br&gt;
Build the failure feedback loop. The Double-DA rule, the verify-before-complete hook, the DA Failure Library — each exists because the system failed in a specific, observable way. The meta-skill isn't building agents. It's building the system that turns agent failures into agent guardrails.&lt;/p&gt;

&lt;p&gt;What's Next&lt;br&gt;
I'm open-sourcing components of this architecture and documenting the patterns that transfer to any multi-agent system. The core principles aren't Claude-specific:&lt;/p&gt;

&lt;p&gt;Route by complexity, not by habit&lt;br&gt;
Enforce adversarial checking at every tier&lt;br&gt;
Compose teams dynamically, not from templates&lt;br&gt;
Make your failures into your guardrails&lt;/p&gt;

&lt;p&gt;If you're building multi-agent systems — on Claude Code, LangChain, CrewAI, or anything else — I'd genuinely like to hear what coordination problems you've hit and how you solved them.&lt;/p&gt;

&lt;p&gt;Built with Claude Code. 13 agents, 19 skills, 10 lifecycle hooks, 141 tests, 7 documented failures. Running in production at @bifrostlabs on GitHub and npm.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
