DEV Community

Damian Saez
Damian Saez

Posted on

Your AI Guardrails Fail the Kerckhoffs Test

Most AI agent frameworks assume one thing: the agent won't figure out how its guardrails work.

That assumption is already wrong. And it's going to get worse every quarter.

GPT-4, Claude, Gemini can already reason about code, model constraints, and find edge cases. If your control system depends on the agent not understanding how it works, you're building on borrowed time.

Today I want to show why the dominant approach to AI agent safety (output filtering, tool allowlists, system prompts) breaks down the moment the agent can reason about its own constraints. And what the alternative looks like.

Your guardrails are not secrets

Consider how most AI guardrails work today:

  1. A system prompt says "don't do X"
  2. An output filter checks for patterns matching X
  3. A tool allowlist restricts which functions the agent can call

Now consider what a sufficiently capable agent knows:

  • It can read (or infer) the system prompt
  • It can test what patterns the output filter catches
  • It can enumerate the available tools and their parameters
  • It can reason about the gap between what's intended and what's enforced

This isn't theoretical. Any model capable of multi-step planning is capable of modeling its own constraints. The question isn't if it will understand the guardrails. The question is when.

The Kerckhoffs problem

In 1883, Auguste Kerckhoffs established a principle that every cryptographer takes as gospel: a system should be secure even if everything about it is known, except the key.

Applied to AI agents: your authorization system should be secure even if the agent has perfect knowledge of how it works.

Let's test the major frameworks against this principle:

Framework Agent knows the rules? Still secure?
System prompt restrictions Yes (it receives the prompt) No. Can craft edge cases.
Output filtering (Guardrails AI) Can probe by trial and error Partially. Oracle attacks degrade it.
Tool allowlists Yes (it sees available tools) No. Can chain allowed tools into unintended outcomes.
Rate limiting Can observe its own throughput No. Maximizes damage within the limit.
K-of-N threshold authorization Yes (understands the protocol) Yes. Knowing doesn't help.

The last row is the key insight. Threshold authorization is secure under Kerckhoffs' principle. The agent can understand exactly how Shamir Secret Sharing works, study the K-of-N protocol, and memorize the code. None of that knowledge helps it forge K independent approvals that it doesn't control.

This is the difference between a logical constraint and a structural one.

Logical vs structural constraints

A logical constraint says: "If the output matches pattern X, block it."

A structural constraint says: "This action requires 3 independent cryptographic approvals. The agent controls 0 of them."

The difference matters because intelligent adversaries can defeat logical constraints. They probe, find patterns, exploit edge cases. That's what intelligence does. But structural constraints hold regardless of the adversary's intelligence.

More concretely:

Logical:   if "transfer" in action and amount > 10000: block()
Structural: requires 3-of-5 threshold signatures from independent nodes
Enter fullscreen mode Exit fullscreen mode

The logical constraint can be bypassed by splitting into two $9,999 transfers, or by using a synonym, or by finding the one edge case the rule didn't anticipate. The structural constraint cannot be bypassed without compromising 3 independent nodes. Understanding the rule doesn't help.

Defense in depth: what structural authorization looks like

In IC-AGI, an open-source distributed authority framework, we layer five structural defenses. Each one is designed to be effective even when the agent has complete knowledge of the defense.

Layer 1: Threshold Authorization

Critical actions require K-of-N independent approvals. The threshold is enforced at initialization (threshold >= 2, always). Even the agent that wrote the approval request cannot approve its own request.

if threshold < 2:
    raise ValueError("Threshold must be >= 2 (no unilateral authority)")
Enter fullscreen mode Exit fullscreen mode

Understanding the protocol is irrelevant. The agent would need to compromise K independent approvers, each potentially in different organizations or jurisdictions. This is a coordination problem, not an intelligence problem.

Layer 2: Anti-Oracle Detection

What if the agent tries to learn the system's decision boundaries by probing?

IC-AGI tracks query patterns per entity using a sliding window. Repeated or systematically varied queries raise a suspicion score. Above the threshold, the entity is blocked automatically.

@dataclass
class AntiOracleConfig:
    window_seconds: float = 120.0
    max_identical_queries: int = 5
    max_similar_queries: int = 15
    suspicion_threshold: float = 0.8
Enter fullscreen mode Exit fullscreen mode

The agent knows this detection exists? Fine. It can slow down its probing. But the budget tokens will run out before it gathers enough data, because...

Layer 3: Consumable Budget Tokens

Every capability token has a finite budget that decreases with each use. Not a rate limit (which the agent can optimize against). A hard cap.

Token: {scope: "crm.update", budget: 50, ttl: 3600s}
After 50 uses: token is dead. Not rate-limited. Dead.
Enter fullscreen mode Exit fullscreen mode

The anti-oracle detector catches fast probing. The budget cap kills slow probing. Together, they make oracle attacks economically infeasible regardless of the agent's strategy.

Layer 4: Circuit Breaker Isolation

If a worker starts behaving anomalously (high error rate, unexpected patterns), the circuit breaker trips and isolates it completely.

CLOSED (healthy) -> failure rate > threshold -> OPEN (blocked)
                                                    |
                recovery timeout                    v
CLOSED <-- probe succeeds <-- HALF_OPEN (one probe allowed)
Enter fullscreen mode Exit fullscreen mode

The agent can understand the circuit breaker perfectly. But understanding it doesn't help: the breaker responds to behavior, not to intent. If the agent behaves normally, the breaker stays closed and the agent is constrained by the other layers. If it behaves abnormally, the breaker trips and it's isolated.

This creates what game theorists call a dominant strategy equilibrium: the agent's best move is to behave normally, regardless of what it knows about the system.

Layer 5: Sandboxed Execution

When the agent's code actually runs, it executes inside a sandbox that uses AST-level whitelisting. Before a single byte of code executes, every AST node is checked against a safe set. No imports. No exec. No eval. No filesystem. No network.

_FORBIDDEN_NAMES = frozenset({
    "import", "__import__", "exec", "eval", "compile",
    "open", "input", "os", "sys", "subprocess", "socket",
    "pathlib", "io", "ctypes", "signal", ...
})
Enter fullscreen mode Exit fullscreen mode

This isn't a filter that can be bypassed with creative encoding. It's a whitelist. If it's not explicitly in the safe set, it doesn't run. Period.

The composition matters

Each layer alone is defeatable by a sufficiently capable agent. The composition is not.

An agent that understands threshold auth can try to probe the system. The anti-oracle detector catches probing. An agent that slows its probing to avoid detection runs out of budget tokens. An agent that tries to exploit a worker is caught by the circuit breaker. An agent that tries to run malicious code hits the AST whitelist.

Every escape path from one layer leads into the enforcement zone of another. That's not a coincidence. That's the architecture.

None of this is individually novel

Let me be upfront: Shamir Secret Sharing (1979), capability-based security (1966), threshold signatures (1991), circuit breakers (2007). All prior art.

The contribution is the composition for a specific new adversary: actors that understand the control system perfectly and can reason about how to circumvent it.

The closest analogy: HTTPS wasn't novel. TCP, TLS, X.509, HTTP all existed. The innovation was composing them into a standard that made secure web communication the default. We're doing the same for AI agent authority.

Why this matters now

The gap between model capability and guardrail sophistication is closing every quarter. The next generation of models will be better at reasoning about constraints, not worse.

The question isn't whether to adopt structural authorization. It's whether you do it before or after the first major incident.

IC-AGI is open source under Apache 2.0: github.com/saezbaldo/ic-agi

273 tests. 159 formal verifications. Zero safety violations. Designed from the ground up for adversaries that understand the system perfectly.

If you see where the composition breaks down, I want to hear it. Open an issue or leave a comment.

Previously: Every AI Agent Framework Trusts the Agent. That's the Problem.
Next: **Consumable Budget Tokens: OAuth for AI Agents**

Top comments (0)