<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Damian Saez</title>
    <description>The latest articles on DEV Community by Damian Saez (@saezbaldo).</description>
    <link>https://dev.to/saezbaldo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3779487%2Fef6d90c8-23c5-4f06-af8a-cdc8f7525b1e.png</url>
      <title>DEV Community: Damian Saez</title>
      <link>https://dev.to/saezbaldo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/saezbaldo"/>
    <language>en</language>
    <item>
      <title>Your AI Guardrails Fail the Kerckhoffs Test</title>
      <dc:creator>Damian Saez</dc:creator>
      <pubDate>Thu, 19 Feb 2026 20:32:38 +0000</pubDate>
      <link>https://dev.to/saezbaldo/what-happens-when-an-ai-agent-understands-its-own-guardrails-25al</link>
      <guid>https://dev.to/saezbaldo/what-happens-when-an-ai-agent-understands-its-own-guardrails-25al</guid>
      <description>&lt;p&gt;Most AI agent frameworks assume one thing: the agent won't figure out how its guardrails work.&lt;/p&gt;

&lt;p&gt;That assumption is already wrong. And it's going to get worse every quarter.&lt;/p&gt;

&lt;p&gt;GPT-4, Claude, Gemini can already reason about code, model constraints, and find edge cases. If your control system depends on the agent &lt;em&gt;not understanding&lt;/em&gt; how it works, you're building on borrowed time.&lt;/p&gt;

&lt;p&gt;Today I want to show why the dominant approach to AI agent safety (output filtering, tool allowlists, system prompts) breaks down the moment the agent can reason about its own constraints. And what the alternative looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your guardrails are not secrets
&lt;/h2&gt;

&lt;p&gt;Consider how most AI guardrails work today:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A system prompt says "don't do X"&lt;/li&gt;
&lt;li&gt;An output filter checks for patterns matching X&lt;/li&gt;
&lt;li&gt;A tool allowlist restricts which functions the agent can call&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now consider what a sufficiently capable agent knows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It can read (or infer) the system prompt&lt;/li&gt;
&lt;li&gt;It can test what patterns the output filter catches&lt;/li&gt;
&lt;li&gt;It can enumerate the available tools and their parameters&lt;/li&gt;
&lt;li&gt;It can &lt;strong&gt;reason&lt;/strong&gt; about the gap between what's intended and what's enforced&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't theoretical. Any model capable of multi-step planning is capable of modeling its own constraints. The question isn't &lt;em&gt;if&lt;/em&gt; it will understand the guardrails. The question is &lt;em&gt;when&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Kerckhoffs problem
&lt;/h2&gt;

&lt;p&gt;In 1883, Auguste Kerckhoffs established a principle that every cryptographer takes as gospel: &lt;strong&gt;a system should be secure even if everything about it is known, except the key.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Applied to AI agents: your authorization system should be secure even if the agent has perfect knowledge of how it works.&lt;/p&gt;

&lt;p&gt;Let's test the major frameworks against this principle:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;Agent knows the rules?&lt;/th&gt;
&lt;th&gt;Still secure?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System prompt restrictions&lt;/td&gt;
&lt;td&gt;Yes (it receives the prompt)&lt;/td&gt;
&lt;td&gt;No. Can craft edge cases.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output filtering (Guardrails AI)&lt;/td&gt;
&lt;td&gt;Can probe by trial and error&lt;/td&gt;
&lt;td&gt;Partially. Oracle attacks degrade it.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool allowlists&lt;/td&gt;
&lt;td&gt;Yes (it sees available tools)&lt;/td&gt;
&lt;td&gt;No. Can chain allowed tools into unintended outcomes.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rate limiting&lt;/td&gt;
&lt;td&gt;Can observe its own throughput&lt;/td&gt;
&lt;td&gt;No. Maximizes damage within the limit.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;K-of-N threshold authorization&lt;/td&gt;
&lt;td&gt;Yes (understands the protocol)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes. Knowing doesn't help.&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The last row is the key insight. Threshold authorization is secure under Kerckhoffs' principle. The agent can understand exactly how Shamir Secret Sharing works, study the K-of-N protocol, and memorize the code. None of that knowledge helps it forge K independent approvals that it doesn't control.&lt;/p&gt;

&lt;p&gt;This is the difference between a &lt;strong&gt;logical&lt;/strong&gt; constraint and a &lt;strong&gt;structural&lt;/strong&gt; one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Logical vs structural constraints
&lt;/h2&gt;

&lt;p&gt;A logical constraint says: "If the output matches pattern X, block it."&lt;/p&gt;

&lt;p&gt;A structural constraint says: "This action requires 3 independent cryptographic approvals. The agent controls 0 of them."&lt;/p&gt;

&lt;p&gt;The difference matters because intelligent adversaries can defeat logical constraints. They probe, find patterns, exploit edge cases. That's what intelligence &lt;em&gt;does&lt;/em&gt;. But structural constraints hold regardless of the adversary's intelligence.&lt;/p&gt;

&lt;p&gt;More concretely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Logical:   if "transfer" in action and amount &amp;gt; 10000: block()
Structural: requires 3-of-5 threshold signatures from independent nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The logical constraint can be bypassed by splitting into two $9,999 transfers, or by using a synonym, or by finding the one edge case the rule didn't anticipate. The structural constraint cannot be bypassed without compromising 3 independent nodes. Understanding the rule doesn't help.&lt;/p&gt;

&lt;h2&gt;
  
  
  Defense in depth: what structural authorization looks like
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://github.com/saezbaldo/ic-agi" rel="noopener noreferrer"&gt;IC-AGI&lt;/a&gt;, an open-source distributed authority framework, we layer five structural defenses. Each one is designed to be effective even when the agent has complete knowledge of the defense.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: Threshold Authorization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Critical actions require K-of-N independent approvals. The threshold is enforced at initialization (&lt;code&gt;threshold &amp;gt;= 2&lt;/code&gt;, always). Even the agent that &lt;em&gt;wrote&lt;/em&gt; the approval request cannot approve its own request.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Threshold must be &amp;gt;= 2 (no unilateral authority)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Understanding the protocol is irrelevant. The agent would need to compromise K independent approvers, each potentially in different organizations or jurisdictions. This is a &lt;em&gt;coordination&lt;/em&gt; problem, not an &lt;em&gt;intelligence&lt;/em&gt; problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: Anti-Oracle Detection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What if the agent tries to learn the system's decision boundaries by probing?&lt;/p&gt;

&lt;p&gt;IC-AGI tracks query patterns per entity using a sliding window. Repeated or systematically varied queries raise a suspicion score. Above the threshold, the entity is blocked automatically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AntiOracleConfig&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;window_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;120.0&lt;/span&gt;
    &lt;span class="n"&gt;max_identical_queries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
    &lt;span class="n"&gt;max_similar_queries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;
    &lt;span class="n"&gt;suspicion_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent knows this detection exists? Fine. It can slow down its probing. But the budget tokens will run out before it gathers enough data, because...&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3: Consumable Budget Tokens&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every capability token has a finite budget that decreases with each use. Not a rate limit (which the agent can optimize against). A hard cap.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Token: {scope: "crm.update", budget: 50, ttl: 3600s}
After 50 uses: token is dead. Not rate-limited. Dead.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The anti-oracle detector catches fast probing. The budget cap kills slow probing. Together, they make oracle attacks economically infeasible regardless of the agent's strategy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 4: Circuit Breaker Isolation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If a worker starts behaving anomalously (high error rate, unexpected patterns), the circuit breaker trips and isolates it completely.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CLOSED (healthy) -&amp;gt; failure rate &amp;gt; threshold -&amp;gt; OPEN (blocked)
                                                    |
                recovery timeout                    v
CLOSED &amp;lt;-- probe succeeds &amp;lt;-- HALF_OPEN (one probe allowed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent can understand the circuit breaker perfectly. But understanding it doesn't help: the breaker responds to &lt;em&gt;behavior&lt;/em&gt;, not to &lt;em&gt;intent&lt;/em&gt;. If the agent behaves normally, the breaker stays closed and the agent is constrained by the other layers. If it behaves abnormally, the breaker trips and it's isolated.&lt;/p&gt;

&lt;p&gt;This creates what game theorists call a &lt;strong&gt;dominant strategy equilibrium&lt;/strong&gt;: the agent's best move is to behave normally, regardless of what it knows about the system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 5: Sandboxed Execution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When the agent's code actually runs, it executes inside a sandbox that uses AST-level whitelisting. Before a single byte of code executes, every AST node is checked against a safe set. No imports. No &lt;code&gt;exec&lt;/code&gt;. No &lt;code&gt;eval&lt;/code&gt;. No filesystem. No network.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_FORBIDDEN_NAMES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;import&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__import__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exec&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;open&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;os&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sys&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subprocess&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;socket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pathlib&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;io&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ctypes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;signal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't a filter that can be bypassed with creative encoding. It's a whitelist. If it's not explicitly in the safe set, it doesn't run. Period.&lt;/p&gt;

&lt;h2&gt;
  
  
  The composition matters
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Each layer alone is defeatable by a sufficiently capable agent. The composition is not.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An agent that understands threshold auth can try to probe the system. The anti-oracle detector catches probing. An agent that slows its probing to avoid detection runs out of budget tokens. An agent that tries to exploit a worker is caught by the circuit breaker. An agent that tries to run malicious code hits the AST whitelist.&lt;/p&gt;

&lt;p&gt;Every escape path from one layer leads into the enforcement zone of another. That's not a coincidence. That's the architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  None of this is individually novel
&lt;/h2&gt;

&lt;p&gt;Let me be upfront: Shamir Secret Sharing (1979), capability-based security (1966), threshold signatures (1991), circuit breakers (2007). All prior art.&lt;/p&gt;

&lt;p&gt;The contribution is the &lt;em&gt;composition&lt;/em&gt; for a specific new adversary: actors that understand the control system perfectly and can reason about how to circumvent it.&lt;/p&gt;

&lt;p&gt;The closest analogy: HTTPS wasn't novel. TCP, TLS, X.509, HTTP all existed. The innovation was composing them into a standard that made secure web communication the default. We're doing the same for AI agent authority.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters now
&lt;/h2&gt;

&lt;p&gt;The gap between model capability and guardrail sophistication is closing every quarter. The next generation of models will be better at reasoning about constraints, not worse.&lt;/p&gt;

&lt;p&gt;The question isn't whether to adopt structural authorization. It's whether you do it before or after the first major incident.&lt;/p&gt;

&lt;p&gt;IC-AGI is open source under Apache 2.0: &lt;a href="https://github.com/saezbaldo/ic-agi" rel="noopener noreferrer"&gt;github.com/saezbaldo/ic-agi&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;273 tests. 159 formal verifications. Zero safety violations. Designed from the ground up for adversaries that understand the system perfectly.&lt;/p&gt;

&lt;p&gt;If you see where the composition breaks down, I want to hear it. Open an issue or leave a comment.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Previously: &lt;a href="https://dev.to/saezbaldo/every-ai-agent-framework-trusts-the-agent-thats-the-problem-5gfa"&gt;Every AI Agent Framework Trusts the Agent. That's the Problem.&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Next: **Consumable Budget Tokens: OAuth for AI Agents&lt;/em&gt;**&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>agents</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Every AI Agent Framework Trusts the Agent. That's the Problem.</title>
      <dc:creator>Damian Saez</dc:creator>
      <pubDate>Wed, 18 Feb 2026 13:10:27 +0000</pubDate>
      <link>https://dev.to/saezbaldo/every-ai-agent-framework-trusts-the-agent-thats-the-problem-5gfa</link>
      <guid>https://dev.to/saezbaldo/every-ai-agent-framework-trusts-the-agent-thats-the-problem-5gfa</guid>
      <description>&lt;p&gt;Every AI agent framework trusts the agent.&lt;/p&gt;

&lt;p&gt;LangChain. AutoGen. CrewAI. Anthropic Tool Use. OpenAI Function Calling. Every single one.&lt;/p&gt;

&lt;p&gt;They validate &lt;em&gt;outputs&lt;/em&gt;. They filter &lt;em&gt;responses&lt;/em&gt;. They scope &lt;em&gt;tools&lt;/em&gt;. But none of them answer a fundamental question: &lt;strong&gt;who authorized this agent to act?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I spent 30 years building software. The last year convinced me this is the most important unsolved problem in AI infrastructure today.&lt;/p&gt;




&lt;h2&gt;
  
  
  The gap nobody talks about
&lt;/h2&gt;

&lt;p&gt;I went through every major AI agent framework and authorization system. Here's what I found:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Year&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Authorization model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI Function Calling&lt;/td&gt;
&lt;td&gt;2023&lt;/td&gt;
&lt;td&gt;LLM calls predefined functions&lt;/td&gt;
&lt;td&gt;None. If the function exists, the agent can call it.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LangChain Tools&lt;/td&gt;
&lt;td&gt;2023&lt;/td&gt;
&lt;td&gt;Agent tool routing&lt;/td&gt;
&lt;td&gt;None. No built-in approval, no budget, no threshold.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic Tool Use&lt;/td&gt;
&lt;td&gt;2024&lt;/td&gt;
&lt;td&gt;Constrained tool execution&lt;/td&gt;
&lt;td&gt;Provider-side only. Not infrastructure-level.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Microsoft AutoGen&lt;/td&gt;
&lt;td&gt;2023&lt;/td&gt;
&lt;td&gt;Multi-agent orchestration&lt;/td&gt;
&lt;td&gt;Agents trust each other. No adversarial model.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CrewAI&lt;/td&gt;
&lt;td&gt;2024&lt;/td&gt;
&lt;td&gt;Multi-agent task framework&lt;/td&gt;
&lt;td&gt;No threshold auth. No formal properties.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Guardrails AI&lt;/td&gt;
&lt;td&gt;2023&lt;/td&gt;
&lt;td&gt;Output validation&lt;/td&gt;
&lt;td&gt;Validates &lt;em&gt;outputs&lt;/em&gt;, not &lt;em&gt;authority to act&lt;/em&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Not a single one implements threshold authorization, consumable budget tokens, or formal verification of safety properties.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All of them assume the agent is trusted, or that filtering its output is good enough.&lt;/p&gt;

&lt;p&gt;It's not.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why output filtering isn't enough
&lt;/h2&gt;

&lt;p&gt;Let's be precise about what happens when an AI agent has tool access:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent → decides action → calls tool → effect happens in the real world
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output validation sits here:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent → decides action → [FILTER] → calls tool → effect happens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It checks: &lt;em&gt;"Is this output safe?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;But it doesn't check: &lt;em&gt;"Does this agent have the authority to take this action right now, with this budget, approved by enough independent parties?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;These are fundamentally different questions. The first is about content. The second is about &lt;strong&gt;power&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;An agent can craft a perfectly valid, perfectly formatted, perfectly "safe-looking" API call that transfers $50,000 to the wrong account. Output validation will pass it. Guardrails will pass it. The action itself is structurally valid.&lt;/p&gt;

&lt;p&gt;The question isn't whether the action &lt;em&gt;looks&lt;/em&gt; safe. The question is whether the agent &lt;em&gt;should have the power to do it at all&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The speed problem
&lt;/h2&gt;

&lt;p&gt;This matters more than most people realize because of the speed differential:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Actor&lt;/th&gt;
&lt;th&gt;Actions per second&lt;/th&gt;
&lt;th&gt;Detection window&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Human employee&lt;/td&gt;
&lt;td&gt;~0.02 (1/minute)&lt;/td&gt;
&lt;td&gt;Hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automated script&lt;/td&gt;
&lt;td&gt;~100&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI agent (unconstrained)&lt;/td&gt;
&lt;td&gt;~1,000&lt;/td&gt;
&lt;td&gt;Seconds — but damage is done&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI agent + budget enforcement&lt;/td&gt;
&lt;td&gt;Budget-capped&lt;/td&gt;
&lt;td&gt;Structural — budget exhaustion stops the agent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Traditional rate limiting slows things down. But it doesn't stop a determined agent from maximizing damage &lt;em&gt;within&lt;/em&gt; the limit. &lt;/p&gt;

&lt;p&gt;What if the budget was cryptographically enforced? When the token hits zero, it's dead. Not rate-limited. Dead.&lt;/p&gt;




&lt;h2&gt;
  
  
  The reasoning problem
&lt;/h2&gt;

&lt;p&gt;Here's the part that keeps me up at night.&lt;/p&gt;

&lt;p&gt;A sufficiently capable agent can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read the source code of whatever controls it&lt;/li&gt;
&lt;li&gt;Understand the policy rules completely&lt;/li&gt;
&lt;li&gt;Find edge cases that technically pass validation&lt;/li&gt;
&lt;li&gt;Chain multiple low-risk actions into a high-risk outcome&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So what do you do when the adversary &lt;em&gt;understands your defenses perfectly&lt;/em&gt;?&lt;/p&gt;

&lt;p&gt;You make the defense structural, not logical.&lt;/p&gt;

&lt;p&gt;Analogy: I can understand exactly how a bank vault door works. I can memorize the blueprints. I can study the metallurgy. None of that knowledge helps me walk through 12 inches of reinforced steel. The constraint isn't my understanding — it's physics.&lt;/p&gt;

&lt;p&gt;That's the principle we need for AI agents. Not smarter filters. &lt;strong&gt;Structural impossibility.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What we're building
&lt;/h2&gt;

&lt;p&gt;We built &lt;a href="https://github.com/saezbaldo/ic-agi" rel="noopener noreferrer"&gt;IC-AGI&lt;/a&gt; — an open-source distributed authority framework — around one thesis:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Separate intelligence from authority.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The agent can be arbitrarily intelligent. It can understand the entire system. But it structurally &lt;em&gt;cannot&lt;/em&gt; act alone on critical actions because:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. K-of-N Threshold Authorization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Critical actions require approval from K independent nodes out of N. Not a single API key. Not a single admin. A cryptographic consensus using Shamir Secret Sharing.&lt;/p&gt;

&lt;p&gt;If K=3 and N=5, compromising 1 or 2 nodes gives you &lt;em&gt;nothing&lt;/em&gt;. Not partial access. Nothing. This is mathematically proven, not just tested.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Consumable Budget Tokens&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every capability token has a budget that decreases with each use. Not just a TTL (time-to-live) — an actual consumption counter. When the budget hits zero, the token is cryptographically dead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Token: {scope: "crm.update", budget: 50, ttl: 3600s}
After 50 updates → token is exhausted, regardless of TTL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OAuth gives you a &lt;em&gt;time-bounded key&lt;/em&gt;. We give you a &lt;em&gt;consumable permit&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Distributed Execution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The agent's logic is split into segments distributed across workers. No single worker sees the full business logic. Even if a worker is compromised, it only has a fragment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Formal Verification&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All safety properties are proven in TLA+ (not just tested — &lt;em&gt;proven&lt;/em&gt;). 159 formal verification checks. Zero violations. The core theorem:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;For any K, N where 1 &amp;lt; K ≤ N: no coalition of fewer than K nodes can authorize a critical action.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This isn't tested for K=3, N=5 and hoped to generalize. It's proven for &lt;em&gt;arbitrary&lt;/em&gt; K and N.&lt;/p&gt;




&lt;h2&gt;
  
  
  The composition is the innovation
&lt;/h2&gt;

&lt;p&gt;Let me be honest: &lt;strong&gt;none of these components are novel individually&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shamir Secret Sharing? 1979.&lt;/li&gt;
&lt;li&gt;Capability-based security? 1966.&lt;/li&gt;
&lt;li&gt;Threshold signatures? 1991.&lt;/li&gt;
&lt;li&gt;Rate limiting? Ancient.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What's novel is the &lt;em&gt;composition&lt;/em&gt; — assembling these into a framework specifically designed to control actors that may be smarter than the control system. No existing system does this.&lt;/p&gt;

&lt;p&gt;The closest analogy: HTTPS is not novel. TCP, TLS, X.509, and HTTP all existed. The innovation was composing them into a standard that made secure web communication the default.&lt;/p&gt;

&lt;p&gt;We're trying to do the same for AI agent authority.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where we are
&lt;/h2&gt;

&lt;p&gt;IC-AGI is at Phase 8 of 10. What's working:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ 273 tests passing (unit, integration, adversarial, formal)&lt;/li&gt;
&lt;li&gt;✅ 159 formal verifications (TLA+ model checking + TLAPS proofs)&lt;/li&gt;
&lt;li&gt;✅ Kubernetes deployment manifests (GKE-ready)&lt;/li&gt;
&lt;li&gt;✅ Every adversarial attack vector tested and blocked&lt;/li&gt;
&lt;li&gt;✅ FastAPI service with REST endpoints&lt;/li&gt;
&lt;li&gt;🔄 Governance protocols (Phase 9 — in progress)&lt;/li&gt;
&lt;li&gt;📋 Production hardening (Phase 10 — planned)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The repo is open source under Apache 2.0: &lt;a href="https://github.com/saezbaldo/ic-agi" rel="noopener noreferrer"&gt;github.com/saezbaldo/ic-agi&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd like from you
&lt;/h2&gt;

&lt;p&gt;I'm not an AI safety researcher. I'm a software engineer who saw a gap and started building. If you work in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Distributed systems&lt;/strong&gt; — our consensus model needs review&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cryptography&lt;/strong&gt; — we use standard primitives but the composition is novel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Formal methods&lt;/strong&gt; — our TLA+ specs could use more eyes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI agent development&lt;/strong&gt; — you know better than anyone where the trust breaks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'd genuinely appreciate your perspective. Open an issue, submit a PR, or just tell me I'm wrong about something. The problem is too important for one team.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;IC-AGI is not about hiding code from intelligence. It is about separating intelligence from authority.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>distributedsystems</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
