DEV Community: Damian Saez

Your AI Guardrails Fail the Kerckhoffs Test

Damian Saez — Thu, 19 Feb 2026 20:32:38 +0000

Most AI agent frameworks assume one thing: the agent won't figure out how its guardrails work.

That assumption is already wrong. And it's going to get worse every quarter.

GPT-4, Claude, Gemini can already reason about code, model constraints, and find edge cases. If your control system depends on the agent not understanding how it works, you're building on borrowed time.

Today I want to show why the dominant approach to AI agent safety (output filtering, tool allowlists, system prompts) breaks down the moment the agent can reason about its own constraints. And what the alternative looks like.

Your guardrails are not secrets

Consider how most AI guardrails work today:

A system prompt says "don't do X"
An output filter checks for patterns matching X
A tool allowlist restricts which functions the agent can call

Now consider what a sufficiently capable agent knows:

It can read (or infer) the system prompt
It can test what patterns the output filter catches
It can enumerate the available tools and their parameters
It can reason about the gap between what's intended and what's enforced

This isn't theoretical. Any model capable of multi-step planning is capable of modeling its own constraints. The question isn't if it will understand the guardrails. The question is when.

The Kerckhoffs problem

In 1883, Auguste Kerckhoffs established a principle that every cryptographer takes as gospel: a system should be secure even if everything about it is known, except the key.

Applied to AI agents: your authorization system should be secure even if the agent has perfect knowledge of how it works.

Let's test the major frameworks against this principle:

Framework	Agent knows the rules?	Still secure?
System prompt restrictions	Yes (it receives the prompt)	No. Can craft edge cases.
Output filtering (Guardrails AI)	Can probe by trial and error	Partially. Oracle attacks degrade it.
Tool allowlists	Yes (it sees available tools)	No. Can chain allowed tools into unintended outcomes.
Rate limiting	Can observe its own throughput	No. Maximizes damage within the limit.
K-of-N threshold authorization	Yes (understands the protocol)	Yes. Knowing doesn't help.

The last row is the key insight. Threshold authorization is secure under Kerckhoffs' principle. The agent can understand exactly how Shamir Secret Sharing works, study the K-of-N protocol, and memorize the code. None of that knowledge helps it forge K independent approvals that it doesn't control.

This is the difference between a logical constraint and a structural one.

Logical vs structural constraints

A logical constraint says: "If the output matches pattern X, block it."

A structural constraint says: "This action requires 3 independent cryptographic approvals. The agent controls 0 of them."

The difference matters because intelligent adversaries can defeat logical constraints. They probe, find patterns, exploit edge cases. That's what intelligence does. But structural constraints hold regardless of the adversary's intelligence.

More concretely:

Logical:   if "transfer" in action and amount > 10000: block()
Structural: requires 3-of-5 threshold signatures from independent nodes

The logical constraint can be bypassed by splitting into two $9,999 transfers, or by using a synonym, or by finding the one edge case the rule didn't anticipate. The structural constraint cannot be bypassed without compromising 3 independent nodes. Understanding the rule doesn't help.

Defense in depth: what structural authorization looks like

In IC-AGI, an open-source distributed authority framework, we layer five structural defenses. Each one is designed to be effective even when the agent has complete knowledge of the defense.

Layer 1: Threshold Authorization

Critical actions require K-of-N independent approvals. The threshold is enforced at initialization (threshold >= 2, always). Even the agent that wrote the approval request cannot approve its own request.

if threshold < 2:
    raise ValueError("Threshold must be >= 2 (no unilateral authority)")

Understanding the protocol is irrelevant. The agent would need to compromise K independent approvers, each potentially in different organizations or jurisdictions. This is a coordination problem, not an intelligence problem.

Layer 2: Anti-Oracle Detection

What if the agent tries to learn the system's decision boundaries by probing?

IC-AGI tracks query patterns per entity using a sliding window. Repeated or systematically varied queries raise a suspicion score. Above the threshold, the entity is blocked automatically.

@dataclass
class AntiOracleConfig:
    window_seconds: float = 120.0
    max_identical_queries: int = 5
    max_similar_queries: int = 15
    suspicion_threshold: float = 0.8

The agent knows this detection exists? Fine. It can slow down its probing. But the budget tokens will run out before it gathers enough data, because...

Layer 3: Consumable Budget Tokens

Every capability token has a finite budget that decreases with each use. Not a rate limit (which the agent can optimize against). A hard cap.

Token: {scope: "crm.update", budget: 50, ttl: 3600s}
After 50 uses: token is dead. Not rate-limited. Dead.

The anti-oracle detector catches fast probing. The budget cap kills slow probing. Together, they make oracle attacks economically infeasible regardless of the agent's strategy.

Layer 4: Circuit Breaker Isolation

If a worker starts behaving anomalously (high error rate, unexpected patterns), the circuit breaker trips and isolates it completely.

CLOSED (healthy) -> failure rate > threshold -> OPEN (blocked)
                                                    |
                recovery timeout                    v
CLOSED <-- probe succeeds <-- HALF_OPEN (one probe allowed)

The agent can understand the circuit breaker perfectly. But understanding it doesn't help: the breaker responds to behavior, not to intent. If the agent behaves normally, the breaker stays closed and the agent is constrained by the other layers. If it behaves abnormally, the breaker trips and it's isolated.

This creates what game theorists call a dominant strategy equilibrium: the agent's best move is to behave normally, regardless of what it knows about the system.

Layer 5: Sandboxed Execution

When the agent's code actually runs, it executes inside a sandbox that uses AST-level whitelisting. Before a single byte of code executes, every AST node is checked against a safe set. No imports. No exec. No eval. No filesystem. No network.

_FORBIDDEN_NAMES = frozenset({
    "import", "__import__", "exec", "eval", "compile",
    "open", "input", "os", "sys", "subprocess", "socket",
    "pathlib", "io", "ctypes", "signal", ...
})

This isn't a filter that can be bypassed with creative encoding. It's a whitelist. If it's not explicitly in the safe set, it doesn't run. Period.

The composition matters

Each layer alone is defeatable by a sufficiently capable agent. The composition is not.

An agent that understands threshold auth can try to probe the system. The anti-oracle detector catches probing. An agent that slows its probing to avoid detection runs out of budget tokens. An agent that tries to exploit a worker is caught by the circuit breaker. An agent that tries to run malicious code hits the AST whitelist.

Every escape path from one layer leads into the enforcement zone of another. That's not a coincidence. That's the architecture.

None of this is individually novel

Let me be upfront: Shamir Secret Sharing (1979), capability-based security (1966), threshold signatures (1991), circuit breakers (2007). All prior art.

The contribution is the composition for a specific new adversary: actors that understand the control system perfectly and can reason about how to circumvent it.

The closest analogy: HTTPS wasn't novel. TCP, TLS, X.509, HTTP all existed. The innovation was composing them into a standard that made secure web communication the default. We're doing the same for AI agent authority.

Why this matters now

The gap between model capability and guardrail sophistication is closing every quarter. The next generation of models will be better at reasoning about constraints, not worse.

The question isn't whether to adopt structural authorization. It's whether you do it before or after the first major incident.

IC-AGI is open source under Apache 2.0: github.com/saezbaldo/ic-agi

273 tests. 159 formal verifications. Zero safety violations. Designed from the ground up for adversaries that understand the system perfectly.

If you see where the composition breaks down, I want to hear it. Open an issue or leave a comment.

Previously: Every AI Agent Framework Trusts the Agent. That's the Problem.
Next: **Consumable Budget Tokens: OAuth for AI Agents**

Every AI Agent Framework Trusts the Agent. That's the Problem.

Damian Saez — Wed, 18 Feb 2026 13:10:27 +0000

Every AI agent framework trusts the agent.

LangChain. AutoGen. CrewAI. Anthropic Tool Use. OpenAI Function Calling. Every single one.

They validate outputs. They filter responses. They scope tools. But none of them answer a fundamental question: who authorized this agent to act?

I spent 30 years building software. The last year convinced me this is the most important unsolved problem in AI infrastructure today.

The gap nobody talks about

I went through every major AI agent framework and authorization system. Here's what I found:

System	Year	What it does	Authorization model
OpenAI Function Calling	2023	LLM calls predefined functions	None. If the function exists, the agent can call it.
LangChain Tools	2023	Agent tool routing	None. No built-in approval, no budget, no threshold.
Anthropic Tool Use	2024	Constrained tool execution	Provider-side only. Not infrastructure-level.
Microsoft AutoGen	2023	Multi-agent orchestration	Agents trust each other. No adversarial model.
CrewAI	2024	Multi-agent task framework	No threshold auth. No formal properties.
Guardrails AI	2023	Output validation	Validates outputs, not authority to act.

Not a single one implements threshold authorization, consumable budget tokens, or formal verification of safety properties.

All of them assume the agent is trusted, or that filtering its output is good enough.

It's not.

Why output filtering isn't enough

Let's be precise about what happens when an AI agent has tool access:

Agent → decides action → calls tool → effect happens in the real world

Output validation sits here:

Agent → decides action → [FILTER] → calls tool → effect happens

It checks: "Is this output safe?"

But it doesn't check: "Does this agent have the authority to take this action right now, with this budget, approved by enough independent parties?"

These are fundamentally different questions. The first is about content. The second is about power.

An agent can craft a perfectly valid, perfectly formatted, perfectly "safe-looking" API call that transfers $50,000 to the wrong account. Output validation will pass it. Guardrails will pass it. The action itself is structurally valid.

The question isn't whether the action looks safe. The question is whether the agent should have the power to do it at all.

The speed problem

This matters more than most people realize because of the speed differential:

Actor	Actions per second	Detection window
Human employee	~0.02 (1/minute)	Hours
Automated script	~100	Minutes
AI agent (unconstrained)	~1,000	Seconds — but damage is done
AI agent + budget enforcement	Budget-capped	Structural — budget exhaustion stops the agent

Traditional rate limiting slows things down. But it doesn't stop a determined agent from maximizing damage within the limit.

What if the budget was cryptographically enforced? When the token hits zero, it's dead. Not rate-limited. Dead.

The reasoning problem

Here's the part that keeps me up at night.

A sufficiently capable agent can:

Read the source code of whatever controls it
Understand the policy rules completely
Find edge cases that technically pass validation
Chain multiple low-risk actions into a high-risk outcome

So what do you do when the adversary understands your defenses perfectly?

You make the defense structural, not logical.

Analogy: I can understand exactly how a bank vault door works. I can memorize the blueprints. I can study the metallurgy. None of that knowledge helps me walk through 12 inches of reinforced steel. The constraint isn't my understanding — it's physics.

That's the principle we need for AI agents. Not smarter filters. Structural impossibility.

What we're building

We built IC-AGI — an open-source distributed authority framework — around one thesis:

Separate intelligence from authority.

The agent can be arbitrarily intelligent. It can understand the entire system. But it structurally cannot act alone on critical actions because:

1. K-of-N Threshold Authorization

Critical actions require approval from K independent nodes out of N. Not a single API key. Not a single admin. A cryptographic consensus using Shamir Secret Sharing.

If K=3 and N=5, compromising 1 or 2 nodes gives you nothing. Not partial access. Nothing. This is mathematically proven, not just tested.

2. Consumable Budget Tokens

Every capability token has a budget that decreases with each use. Not just a TTL (time-to-live) — an actual consumption counter. When the budget hits zero, the token is cryptographically dead.

Token: {scope: "crm.update", budget: 50, ttl: 3600s}
After 50 updates → token is exhausted, regardless of TTL

OAuth gives you a time-bounded key. We give you a consumable permit.

3. Distributed Execution

The agent's logic is split into segments distributed across workers. No single worker sees the full business logic. Even if a worker is compromised, it only has a fragment.

4. Formal Verification

All safety properties are proven in TLA+ (not just tested — proven). 159 formal verification checks. Zero violations. The core theorem:

For any K, N where 1 < K ≤ N: no coalition of fewer than K nodes can authorize a critical action.

This isn't tested for K=3, N=5 and hoped to generalize. It's proven for arbitrary K and N.

The composition is the innovation

Let me be honest: none of these components are novel individually.

Shamir Secret Sharing? 1979.
Capability-based security? 1966.
Threshold signatures? 1991.
Rate limiting? Ancient.

What's novel is the composition — assembling these into a framework specifically designed to control actors that may be smarter than the control system. No existing system does this.

The closest analogy: HTTPS is not novel. TCP, TLS, X.509, and HTTP all existed. The innovation was composing them into a standard that made secure web communication the default.

We're trying to do the same for AI agent authority.

Where we are

IC-AGI is at Phase 8 of 10. What's working:

✅ 273 tests passing (unit, integration, adversarial, formal)
✅ 159 formal verifications (TLA+ model checking + TLAPS proofs)
✅ Kubernetes deployment manifests (GKE-ready)
✅ Every adversarial attack vector tested and blocked
✅ FastAPI service with REST endpoints
🔄 Governance protocols (Phase 9 — in progress)
📋 Production hardening (Phase 10 — planned)

The repo is open source under Apache 2.0: github.com/saezbaldo/ic-agi

What I'd like from you

I'm not an AI safety researcher. I'm a software engineer who saw a gap and started building. If you work in:

Distributed systems — our consensus model needs review
Cryptography — we use standard primitives but the composition is novel
Formal methods — our TLA+ specs could use more eyes
AI agent development — you know better than anyone where the trust breaks

I'd genuinely appreciate your perspective. Open an issue, submit a PR, or just tell me I'm wrong about something. The problem is too important for one team.

IC-AGI is not about hiding code from intelligence. It is about separating intelligence from authority.