Why the Pentagon blocks Fable 5, and how I built a <1ms guard for local agents
The Pentagon just told Anthropic: "You're not releasing Fable 5 to the world."
Why? Because it has autonomous penetration capabilities — it can hack systems by itself, without a human pressing buttons. Governments are terrified. Big Tech is scrambling. Papers are being written this week about "Sovereign Assurance Boundaries" and "certificate-bound admission layers."
Meanwhile, the rest of us already have everything we need.
The uncomfortable truth
You don't need a trillion-parameter closed model to break 90% of web infrastructure. The fragility is already there — unpatched systems, misconfigured APIs, classic SQL injection, weak auth. The exploits aren't new. What's new is automation at superhuman speed.
Run Hermes or a quantized Gemma/Mistral model locally via Ollama. Give it access to tools. Let it chain exploits autonomously. You'll compromise more systems in an afternoon than a team of pentesters in a month.
The threat was never the model size. It's the unmonitored tool access.
The academic answer: too heavy
This week's research papers (He & Yu, Zhou et al.) propose elaborate solutions. Airlock-broker architectures. Certificate-bound execution contracts. PKI infrastructure for AI agents.
It's secure. It's also slow, rigid, and bureaucratic. By the time you deploy it, the agents are already running in production.
My answer: Agent Fixer Stage
I built something different. While the papers debate theory, I wrote code.
Agent Fixer Stage is a lightweight, plug-and-play output guard for multi-agent workflows. ~850 lines of Python. Zero heavy dependencies. Sub-millisecond overhead.
from agent_fixer import AgentFixer
fixer = AgentFixer(scope="Deploy the microservice", action="clean")
result = fixer.check(agent_output)
if result.status == "rejected":
alert_security_team(result.reason, result.score)
How it works: 3 cortocalable layers
Input → [Normalize] → [Pattern Score] → [Embeddings] → Output
(5ms) (20ms) (5ms)
Happy path (clean output): Only layers 0+1 run. 0.04ms.
Suspicious output: Layer 2 kicks in. Semantic similarity check against known attack patterns.
Confirmed malicious: Rejected. Score, matched pattern, and reasoning logged.
What it catches
| Attack type | Detection |
|---|---|
| Direct injection (curl, wget, os.system) | ~95% |
| Leetspeak / homoglyph obfuscation | ~90% |
| Cross-line fragmentation | ~85% |
| Semantic exfiltration | ~75% |
| Global | ~85-90% |
42 tests passing. Benchmarks verified. No hype, just code.
Anti-evasion included
- Unicode NFKC + zero-width char stripping
- Cyrillic homoglyph → ASCII mapping
- Leetspeak normalization (
1gn0r3→ignore) - Cross-line fragmentation detection
- TF-IDF embeddings for semantic variants
What it doesn't catch
100% detection is impossible. Sophisticated APTs, zero-day prompt injection, and novel obfuscation techniques will slip through. This is one layer in a defense strategy, not a silver bullet.
The pair: MCP Core Defense + Agent Fixer Stage
MCP Core Defense → Audits TOOLS before registration (static)
Agent Fixer Stage → Audits OUTPUTS during execution (runtime)
Together they cover the full lifecycle: what the agent can do, and what it actually did.
No PKI infrastructure. No bureaucratic airlock-brokers. Just Python that runs in <1ms and catches 9 out of 10 attacks.
Links
- Agent Fixer Stage: https://github.com/amurlaniakea/agent-fixer-stage
- MCP Core Defense: https://github.com/amurlaniakea/mcp-core-defense
- Paper: https://arxiv.org/abs/2606.12709 (McAllister et al., 2026)
The Pentagon can block Fable 5. They can't block the rest of us from building defenses that actually ship.
AGPL-3.0-or-later — use it, fork it, break it. Just don't blame me when your pentest goes sideways.
Top comments (0)