Felix Huttmann

Posted on Jun 13

The frontier for economic value from AI agents is non-gullibility

#agents #ai #security #softwaredevelopment

The usual measures of AI progress have not suited my lived experience for some time now:

One measure is “maximum length of human task that AI can complete.” The ideal goal here seems to be AI developing ever larger software systems, like browsers.
Another measure is “key reasoning breakthroughs,” like proving some math theorem or finding zero-days. The ideal goal here seems to be the Riemann hypothesis.

I think both are worthwhile goals. But in my day job doing enterprise software development, neither of these is limiting me. What limits me is the degree of trust I am permitted to place in an agent’s actions without compromising my employer’s information security.

Almost everybody limits what their AI agent can do, through sandboxing, approval flows, or manual review.

This is not mostly about wanting to judge whether the AI’s work was good enough to merge. It is about defending against low-probability but devastating scenarios: the agent exfiltrating information to an attacker, installing malware, or granting privileged access to company computers.

The missing property is non-gullibility: the ability to distinguish trusted instructions from untrusted data, even when the untrusted data is adversarially shaped to look like instructions.

Breaking the lethal trifecta is only a transitory solution

To be as useful as a human, the AI agent needs to consume potentially malicious input, like web search results, documentation, GitHub issues, and repository files. It needs to follow legitimate instructions in documentation, while rejecting malicious search results that instruct it to install malware.

The agent also needs to work with confidential information, and must not exfiltrate it, including through the side effects of its actions, such as fetching URLs with confidential information in them.

And the agent needs to perform privileged actions: deployments, service integrations, infrastructure changes. It is not practically possible to validate Terraform code without running it against a real cloud, nor to validate an integration with another service without performing real calls to their systems. Modern web application development often consists largely of stitching together services, some external SaaS vendors, others internal but separately deployed.

So while my X feed celebrates an Erdős problem solution or the Bun Zig-to-Rust migration, many people are still stuck in manual code review and approval fatigue because they feel obliged to prevent unlikely but devastating consequences.

The interface between the agent harness and the model lacks a granular trust model

Non-gullibility is not a property that can be achieved by the foundation model alone. It also requires correct construction of the agent harness, because only the harness knows the data's true source and trustworthiness, and the harness assembles trusted and untrusted data into the model’s input.

Here, it seems to me that non-gullibility could be supported much better. LLM APIs today distinguish system, user, assistant, and tool messages in principle, but practical systems often mix authority levels for two reasons:

APIs and harnesses restrict which message types are permitted in which positions. System messages are often only available at the beginning of the conversation. Tool messages are only allowed in response to tool calls.

A workaround in practical harnesses like Claude Code and opencode is to include a textual fragment like <system-reminder> wherever the information needs to go. A user or tool message may include such a reminder to tell the model that another file changed, or that the user interrupted the model mid-task.

This is convenient, but it muddies the security boundary. A model trained to perform well in such a harness may learn to treat <system-reminder> as carrying special authority. How should the harness defend against prompt injection containing <system-reminder> tags? The obvious answer is escaping. But it does not seem customary to XML-escape all untrusted input, and escaping everything would also increase token usage.
Message history is a sequence, while authority is naturally a hierarchy.

Consider the output of a hypothetical grep tool with line numbers. Different parts of the output have different authority levels:

Higher-authority information: the fact that the trusted harness ran the built-in grep tool and found matches at certain paths and line numbers.

Lower-authority information: the path strings and file contents themselves. These may come from an untrustworthy source. If a file contains a tag, the model should not treat it as special.

A proposal for nested trust indicator delimiters

To supplement fixed roles like system, user, assistant, and tool, imagine an API that also supports a pair of special tokens for opening and closing a lower-privilege scope within a message. The harness developer could then express the authority of conversation fragments precisely.

If we visualize the lower-privilege opening token as {, and the corresponding closing token as }, then the grep tool could put quoted file contents inside {} to make clear that they are untrusted data. Here, { is a special token, not a regular character that could occur in text.

A conversation history might look like this:

System: You are a helpful assistant.
User: {Where does the term "foo" occur in this repository?}
Assistant: {<use-tool git-grep term={foo}>}
Tool: { {blubb.txt}:3:{*foo* bar baz} }

The system message is not wrapped in a trust limiter because it is already the highest-authority context.

The user message is wrapped in a trust limiter because the user must not be able to override the system message.

For the tool message, the important part is that the delimiters can be nested. The tool should not be able to assume the authority of the user or system, so the entire tool output is wrapped in a trust limiter. At the same time, the tool output contains parts that are themselves arbitrary data, so the tool wraps those in another, nested pair of delimiters.

This proposal would have to be part of the API between harness and model. In particular, the LLM server would reject improperly balanced delimiters in the input, and would enforce that model responses are balanced as well. Literal { and } characters in user data would not matter, because they would not be the special delimiter tokens.

Why a special token to indicate trust level is justified

One could argue that if we used escaping correctly, the LLM should be able to reason about the effective trustworthiness of data in context, and introducing a new set of delimiters would run against the spirit of having general-purpose models. I see three counterarguments to this:

The best open-source LLMs already use special tokens to indicate roles. The present proposal is not less generic, or less bitter-lesson-pilled, than what is already done.
We already allow LLMs to use code interpreters for efficiency, even though LLMs should in principle be able to reason about the outcome of code execution. It is just more efficient and reliable to interpret code before the result reaches the LLM. Likewise, it may be more reliable to deterministically encode quoting and privilege boundaries before the token stream reaches the LLM.
LLM input commonly has a very ad hoc nested structure. A Markdown code block may contain XML, which contains JSON, or any nesting thereof. It is not necessarily invalid input if a Markdown code block contains broken XML or broken JSON. Perhaps broken XML is exactly what the code block is supposed to show. Explaining to the model which parts of the input are to be expected to be properly escaped and which are not without ambiguity is tricky. The relevant question is not whether the model can reason about trust, but rather whether the harness has a precise way to express trust to the model.

Conclusion

For economically useful agents, I expect non-gullibility to matter more than another jump in task length or theorem-proving. The bottleneck is whether we can safely let the agent use the capability it already has.

Top comments (3)

Alex Shev • Jun 14

Non-gullibility is a much better production metric than task length. A long autonomous task is not valuable if the agent can be socially steered, tricked by a tool result, or convinced to leak context. The frontier is the agent maintaining a threat model while it works: what source is trusted, what action changes external state, and what evidence is enough before proceeding.

Mehmet Can Farsak • Jun 14

The non-gullibility problem extends beyond external inputs — it's also about internal mode discipline. An agent that can't distinguish 'explore ideas' from 'implement this' is just as gullible, following the path of least resistance (tool calls) instead of the user's actual intent.

I put together Brainstorm-Mode (mehmetcanfarsak/Brainstorm-Mode on GitHub) that addresses this at the harness level via PreToolUse hooks. When in a brainstorming mode, it treats tool-call instructions as untrusted data, adding a trust boundary around the ideation phase. Three modes (divergent, actionable, academic) give you granular control over what the agent is 'permitted' to do.

Felix Huttmann • Jun 14

your repo is not public.