DEV Community

Fenix
Fenix

Posted on

Why the Pentagon blocks Fable 5, and how I built a <1ms guard for local agents

Why the Pentagon blocks Fable 5, and how I built a <1ms guard for local agents

The Pentagon just told Anthropic: "You're not releasing Fable 5 to the world."

Why? Because it has autonomous penetration capabilities — it can hack systems by itself, without a human pressing buttons. Governments are terrified. Big Tech is scrambling. Papers are being written this week about "Sovereign Assurance Boundaries" and "certificate-bound admission layers."

Meanwhile, the rest of us already have everything we need.

The uncomfortable truth

You don't need a trillion-parameter closed model to break 90% of web infrastructure. The fragility is already there — unpatched systems, misconfigured APIs, classic SQL injection, weak auth. The exploits aren't new. What's new is automation at superhuman speed.

Run Hermes or a quantized Gemma/Mistral model locally via Ollama. Give it access to tools. Let it chain exploits autonomously. You'll compromise more systems in an afternoon than a team of pentesters in a month.

The threat was never the model size. It's the unmonitored tool access.

The academic answer: too heavy

This week's research papers (He & Yu, Zhou et al.) propose elaborate solutions. Airlock-broker architectures. Certificate-bound execution contracts. PKI infrastructure for AI agents.

It's secure. It's also slow, rigid, and bureaucratic. By the time you deploy it, the agents are already running in production.

My answer: Agent Fixer Stage

I built something different. While the papers debate theory, I wrote code.

Agent Fixer Stage is a lightweight, plug-and-play output guard for multi-agent workflows. ~850 lines of Python. Zero heavy dependencies. Sub-millisecond overhead.

from agent_fixer import AgentFixer

fixer = AgentFixer(scope="Deploy the microservice", action="clean")
result = fixer.check(agent_output)

if result.status == "rejected":
    alert_security_team(result.reason, result.score)
Enter fullscreen mode Exit fullscreen mode

How it works: 3 cortocalable layers

Input → [Normalize] → [Pattern Score] → [Embeddings] → Output
         (5ms)         (20ms)            (5ms)
Enter fullscreen mode Exit fullscreen mode

Happy path (clean output): Only layers 0+1 run. 0.04ms.

Suspicious output: Layer 2 kicks in. Semantic similarity check against known attack patterns.

Confirmed malicious: Rejected. Score, matched pattern, and reasoning logged.

What it catches

Attack type Detection
Direct injection (curl, wget, os.system) ~95%
Leetspeak / homoglyph obfuscation ~90%
Cross-line fragmentation ~85%
Semantic exfiltration ~75%
Global ~85-90%

42 tests passing. Benchmarks verified. No hype, just code.

Anti-evasion included

  • Unicode NFKC + zero-width char stripping
  • Cyrillic homoglyph → ASCII mapping
  • Leetspeak normalization (1gn0r3ignore)
  • Cross-line fragmentation detection
  • TF-IDF embeddings for semantic variants

What it doesn't catch

100% detection is impossible. Sophisticated APTs, zero-day prompt injection, and novel obfuscation techniques will slip through. This is one layer in a defense strategy, not a silver bullet.

The pair: MCP Core Defense + Agent Fixer Stage

MCP Core Defense → Audits TOOLS before registration (static)
Agent Fixer Stage → Audits OUTPUTS during execution (runtime)
Enter fullscreen mode Exit fullscreen mode

Together they cover the full lifecycle: what the agent can do, and what it actually did.

No PKI infrastructure. No bureaucratic airlock-brokers. Just Python that runs in <1ms and catches 9 out of 10 attacks.

Links

The Pentagon can block Fable 5. They can't block the rest of us from building defenses that actually ship.


AGPL-3.0-or-later — use it, fork it, break it. Just don't blame me when your pentest goes sideways.

Top comments (0)