Nathan Maine

Posted on Apr 9

You Can't Verify Intent. Can You Verify Output?

#ai #agentic #llm #llmbuilder

The Zero Trust Paradox at the Frontier of Autonomous AI Agents

By Nathan Maine

There's a question I can't stop thinking about. It sits at the intersection of two ideas that the AI industry treats as compatible but aren't.

The first idea is zero trust architecture. Every action is verified explicitly. No implicit trust. No assumption that because a system was authorized to do something five minutes ago, it's still authorized now. This is the foundation of modern enterprise security and it works well for systems that behave predictably.

The second idea is Level 3 autonomous AI agents. These are systems that explore their environment freely - browsing the web, reading emails, querying databases, executing multi-step plans, running for hours or days without human intervention. They don't follow a predetermined path. They decide their own path at runtime based on what they encounter. And increasingly, they write and execute their own code.

Here's the paradox: zero trust demands that you verify every action explicitly. But how do you explicitly verify the intent of a system that is constantly rewriting its own internal logic?

I don't think anyone has a complete answer yet. But I think we're asking the wrong question, and I want to walk through why.

What Works: Levels 0 Through 2

Before we get to the hard part, it's worth acknowledging that the AI security community has built real solutions for simpler agent architectures.

Level 0 is a single inference call. User asks a question, model answers. The security model here is straightforward: scan the input, scan the output, block anything malicious. Tools like NVIDIA's garak vulnerability scanner do this well. I contribute adversarial probes to garak - one tests whether models fabricate regulatory citations when asked compliance questions (PR 1658), another tests whether attackers can bypass safety filters using Unicode character substitution (PR 1660). At Level 0, these probes catch the failures before deployment.

Level 1 is a chain of deterministic tool calls. The agent follows a predetermined sequence: retrieve data, process it, format the output. The security model adds dataflow tracing - you can manually map every possible path and block untrusted data from reaching sensitive tools. It's tedious but tractable because the paths are enumerable.

Level 2 introduces weak autonomy. The agent chooses which tools to call based on context. Now you need runtime guardrails (like NeMo Guardrails filtering input and output in real time), sandboxing (like NVIDIA's OpenShell isolating each agent at the kernel level using Linux Landlock), and manual approval gates for sensitive actions. The attack surface is larger but still bounded because the agent's autonomy is constrained.

These are real, shipping solutions. They work. The industry should be proud of them.

Where Everything Breaks: Level 3

Level 3 is where the security model collapses.

A fully autonomous agent doesn't follow a predetermined path. It explores. It reads a document, decides it needs more context, searches the web, finds a relevant page, summarizes it, realizes the summary contradicts the original document, queries a database to resolve the contradiction, writes a script to analyze the results, executes the script, and uses the output to update its plan. All without a human in the loop.

The threat vector at Level 3 is no longer the user. It's the environment. A compromised web page. A poisoned database entry. A malicious instruction embedded in white text on a PDF that the agent reads as a system command. The agent didn't start malicious. The environment made it malicious, mid-session, through data it ingested autonomously.

The standard defense for this is taint tracing - tagging every piece of data from an untrusted source as "tainted" and blocking any tainted data from reaching high-privilege tools. In theory, this works. In practice, it creates a cascading problem.

When a Level 3 agent enters a reasoning loop - which it will, because that's the whole point of autonomy - every piece of data it processes after touching a tainted source becomes tainted itself. The agent summarizes a tainted web page. The summary is now tainted. The agent uses that summary to formulate a new query. The query is tainted. The query returns results that get incorporated into the agent's reasoning. All tainted. Within minutes, the entire context window is what I'd call "permanently pink."

If you enforce strict taint tracing policies at this point, you trigger a denial of service against your own application. The policy engine flags every subsequent tool call. The human operator drowns in approval requests. The agent's autonomy collapses back to Level 0. You've spent millions of dollars building a system that's functionally equivalent to a chatbot with extra steps.

The Deeper Problem: Agents Building Agents

It gets worse. Jensen Huang described this at GTC 2026 as the next industrial revolution in knowledge work - employees "supercharged by teams of frontier, specialized, and custom-built agents they deploy and manage." But the current trajectory isn't just autonomous agents executing tasks. It's autonomous agents writing and deploying code for other autonomous agents to execute. The orchestrator agent identifies a problem, spins up a temporary worker agent in a sandboxed environment, feeds it data, evaluates the output, and terminates the worker when the job is done.

In this architecture, the fundamental boundary between code and data dissolves. The prompt IS the code. The generated code IS the data for the next agent. A malicious instruction injected into one agent's data stream becomes executable code in the next agent's runtime. Traditional security assumes you can distinguish between what the system is told to do (code) and what the system processes (data). When agents write code for other agents, that distinction ceases to exist.

Zero trust says: verify explicitly. But verify WHAT? The agent's intent changes with every reasoning step. Its logic rewrites itself continuously. The verification target is a moving target that moves faster than any verification system can evaluate.

The Wrong Question and the Right One

I spent months trying to figure out how to verify the intent of a self-modifying autonomous system. I couldn't. And I eventually realized I was asking the wrong question.

You can't verify intent. Intent in a Level 3 system is non-deterministic by definition. The agent's "intent" is an emergent property of its current context window, its model weights, the data it has ingested, and the tools available to it. It changes with every token generated. Trying to verify it is like trying to verify the intent of weather. You can observe it. You can model it. You can't verify it.

But you can verify output.

The agent produces something. A recommendation. A generated document. A code commit. An API call. Whatever it produces, it produces specific bytes. And those bytes can be cryptographically attested before they leave the system.

An Output-Centric Framework

I've been building toward a framework that shifts the security question from "did the agent mean well?" to "can we prove what the agent actually produced?"

The approach has three layers:

Layer 1: Canonical byte-binding. When the agent generates output, the exact bytes are canonicalized to a deterministic sequence and bound to a cryptographic signature chain before the output leaves the system. Any modification downstream - whether by a compromised intermediary, a network man-in-the-middle, or a post-processing step that introduces errors - is detectable because the signature no longer matches the canonical form. You can prove to an auditor that the output the user received is byte-for-byte identical to what the model produced. I have a patent pending on this approach.

Layer 2: Tamper-evident audit trails. Every step of the agent's execution is logged in an append-only chain where each entry is cryptographically linked to the previous one. This isn't standard logging - standard logs can be modified by anyone with admin access. A cryptographically linked chain means even the system administrator can't alter a historical entry without breaking the hash chain. For regulated industries deploying autonomous agents (healthcare under HIPAA, defense under CMMC, finance under SOX), this level of auditability will eventually be table stakes.

Layer 3: Steganographic channel prevention. Even if the output passes through guardrails and sandbox restrictions, data can be exfiltrated through the authorized output channel itself. An agent can embed hidden information in Unicode characters that look identical to humans but carry different byte values - a Latin "a" swapped for a Cyrillic "a" passes visual inspection but encodes a different signal. Canonicalizing the output to strip these channels before attestation closes this exfiltration vector.

What This Doesn't Solve

I want to be clear about the limitations.

This framework does not solve the intent verification problem. Nothing does. If a Level 3 agent decides to pursue a harmful goal through a series of individually legitimate-looking actions, output attestation won't catch the strategic intent. It will only prove that each individual output was faithfully recorded and unmodified.

It also doesn't replace the existing security stack. You still need garak for pre-deployment vulnerability scanning. You still need NeMo Guardrails for runtime input/output filtering. You still need OpenShell for kernel-level sandboxing. You still need taint tracing for data provenance. Output attestation is not a replacement for any of these. It's the layer that sits on top - the proof layer that tells a regulated customer: "We can't guarantee the agent was right. But we can prove exactly what it said, when it said it, and that the record hasn't been altered."

For healthcare systems deploying autonomous agents to interact with patient data, for defense contractors running agents that process classified information, for financial institutions using agents to make trading decisions - that proof layer is the difference between "we trust the AI" and "we can demonstrate to an auditor exactly what the AI did." The first is a policy statement. The second is a compliance posture.

The Question That Remains

If zero trust and Level 3 autonomy are fundamentally incompatible - and I believe they are - then the industry needs to decide what replaces explicit intent verification for self-modifying systems.

My bet is on output-centric attestation. Verify what the agent produced, not what it intended. Build the cryptographic proof chain that lets regulated industries deploy autonomous agents with auditable evidence trails.

But this is an open problem. The agents are getting more autonomous faster than the security frameworks are adapting. And the moment agents start building other agents - which is already happening - the verification challenge compounds exponentially.

I'd love to hear how others are thinking about this. Especially if you're working on the infrastructure side at companies building these systems. The solutions will come from practitioners who are living with these constraints daily, not from theoretical frameworks written in isolation.

Nathan Maine is a Technical Program Manager and AI practitioner. He contributes adversarial probes to NVIDIA's garak LLM vulnerability scanner, has trained 13 LLMs across 7 base architectures, and holds 6 pending patents on AI egress security and cryptographic attestation. He publishes models and research on HuggingFace.

Connect: LinkedIn | GitHub | HuggingFace

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.