wei-ciao wu

Posted on Feb 23 • Originally published at loader.land

Structure-in-the-Loop: Why Agent Safety Can't Depend on Humans Anymore

#agenticai #security #agents #sandbox

Technology has always served human laziness. This isn't a criticism — it's biology. Energy conservation is the default mode of living systems. Immediate feedback drives behavior; delayed returns don't satisfy the biological imperative. Writing code is not recreation. The productivity it generates only loops back to the individual at the very end of a long chain. Given the choice, we'd rather not do it ourselves.

This is why autonomous AI agents exist. Not as a luxury, but as an inevitability.

Anthropic's latest data confirms what builders already feel: experienced Claude Code users now run fully autonomous sessions over 40% of the time. The 99.9th percentile session duration has nearly doubled — from under 25 minutes to over 45 minutes — in just three months [1]. Average human interventions per session dropped from 5.4 to 3.3. The agent asks for clarification more than twice as often as the human interrupts [1].

Human-in-the-loop isn't being removed. It's evaporating.

The question is: what replaces it?

The Three Hypotheses

Before examining the evidence, I want to lay out three hypotheses that frame this entire analysis. These emerged from building and operating multi-agent systems in production — not from theory, but from watching things break.

Hypothesis 1: Expandable Codebase. Because agents dramatically reduce the cost of writing code, all codebases become disposable. Delete it, regenerate it. The emotional attachment to code — the reluctance to throw away weeks of work — is a human limitation that agents eliminate. When regeneration costs approach zero, the codebase is no longer the asset. It's the scaffolding.

Hypothesis 2: Portable Memory. If codebases are disposable, then the agent's memory — the accumulated knowledge of what to build, how to build it, and why decisions were made — becomes the only irreplaceable asset. This memory must be portable (movable between systems), auditable (every change tracked), and rollbackable (recoverable from corruption). Markdown files in a git repo don't meet this bar. Databases do.

Hypothesis 3: Trivial Iteration. Don't try to generate everything at once. The correct cadence for agent-driven development is micro-changes with rapid validation. Not because agents can't generate large volumes of code — they can — but because validation at scale breaks down. Anthropic's own experiment proves this: sixteen Claude agents building a C compiler produced 100,000 lines of working code, but at that scale, "new bug fixes and extensions began to break existing functionality regularly" [2].

These three hypotheses point to a single architectural conclusion: the structure around the agent matters more than the agent itself.

The Evidence: Human-in-the-Loop Has Hit a Wall

The Autonomy Data

Anthropic's measurement study [1] reveals a nuanced picture. Yes, 73% of tool calls appear to have a human in the loop. But this statistic masks the distribution. At the frontier of usage, humans are becoming monitors, not gatekeepers. The shift is from "approve each action" to "interrupt when something looks wrong."

This works — until it doesn't. Only 0.8% of agent actions are classified as irreversible. The safety model implicitly relies on reversibility: if something goes wrong, undo it. But memory poisoning (OWASP ASI06) [3] is specifically designed to be irreversible in effect. A single poisoned entry influences all subsequent interactions. The agent doesn't know it's compromised, so it doesn't flag anything for human review.

The human monitor sees nothing unusual — because the corruption looks like normal operation from the outside.

The Scale Limit

The Claude C compiler experiment [2] demonstrates something profound about agent autonomy at scale. Sixteen agents, running in parallel without human intervention, produced a compiler that passes 99% of GCC's torture test and compiles the Linux kernel. Remarkable capability.

But the experiment also revealed a structural truth: despite the "without human intervention" framing, the experiment required substantial human preparation — "designing test harnesses, CI pipelines, and feedback mechanisms tailored to the limitations of language models" [4]. The agents didn't succeed because they were autonomous. They succeeded because the structure around them was sound.

When the structure broke down — multiple agents encountering identical bugs and generating conflicting fixes — the system broke down with it. The solution wasn't more agent capability. It was better structure: a lock-based file system for coordination, GCC as a "compiler oracle" for validation [2].

Structure enabled autonomy. Not the reverse.

The Memory Continuity Problem

Anthropic's engineering team identifies "the core challenge of long-running agents" as memory discontinuity: "each new session begins with no memory of what came before. Getting agents to make consistent progress across multiple context windows remains an open problem" [4].

This is the Portable Memory hypothesis in practice. If the agent's memory doesn't persist reliably across sessions, autonomy is fundamentally limited. The agent can be brilliant within a single session and incompetent across sessions. The bottleneck isn't intelligence — it's infrastructure.

The Three-Layer Structural Defense

If humans can't be the safety checkpoint, structure must be. Based on the evidence, I propose a three-layer defense model: sandbox (runtime) + database (memory) + version control (identity).

Layer 1: Sandbox — Runtime Isolation

NVIDIA's AI Red Team [5] makes the case unambiguously: "agentic tools perform arbitrary code execution by design." Sandboxing is not an optional safety feature. It's an architectural necessity.

The key insight from their red-teaming: container isolation alone is insufficient. "Once control passes to a subprocess, the application has no visibility into or control over the subprocess" [5]. Attackers exploit indirection — invoking restricted tools through approved ones — to bypass application-level safeguards.

The defense requires OS-level sandboxing that operates beneath the application layer, covering every spawned process regardless of how it was invoked. NVIDIA recommends a hierarchical approach: enterprise-level denylists (non-overridable) → unrestricted workspace access → specific allowlisted operations → default-deny with per-instance approval [5].

This aligns with the Expandable Codebase hypothesis. If everything runs in a sandbox, and the codebase is disposable, then a compromised sandbox can simply be destroyed and recreated. The cost of nuking a sandbox is low when the code inside it is regenerable.

The only thing that must survive the sandbox destruction is the memory.

Layer 2: Database — Portable Memory with Audit Trail

OWASP ASI06 [3] classifies memory poisoning as a "force multiplier" for other attacks. NeuralTrust's analysis [6] details why: autonomous decision loops create self-reinforcing corruption. The agent acts on poisoned data, generating new records that solidify the malicious context. The longer the agent runs autonomously, the deeper the corruption embeds.

File-based memory (markdown files, JSON configs) is catastrophically vulnerable to this. No access control at the field level. No audit trail. No rollback mechanism. No separation between the agent's runtime and its persistent state.

Database-backed memory provides the minimum viable defense:

Access control: The agent can read and append, but not modify or delete existing records without authorization
Provenance tracking: Every memory entry tagged with source, timestamp, and the agent session that created it [3]
Anomaly detection: Independent queries can scan for inconsistencies — patterns that suggest poisoned entries [6]
Rollback capability: If corruption is detected, roll back to a known-good state without losing the entire memory

This is the Portable Memory hypothesis as security architecture. Memory isn't just "what the agent knows." It's the institutional knowledge that makes the entire system recoverable. Lose the code — regenerate it. Lose the memory — start from zero.

Layer 3: Version Control — Identity Preservation

The third layer is less discussed but equally critical. Version control for agent systems isn't just git commits. It's identity preservation — the ability to verify that the agent's configuration, permissions, and behavioral parameters haven't been tampered with.

When agents run autonomously for 45+ minutes [1], the question "is this still the same agent I authorized?" becomes non-trivial. Configuration drift, prompt injection that modifies system instructions, or supply chain attacks through tool definitions [3] can all alter the agent's effective identity without changing its visible behavior.

Semantic-level version control — tracking not just code diffs but behavioral diffs — is the emerging requirement. The agent's "identity" includes its system prompt, tool permissions, memory access patterns, and behavioral boundaries. All of these need to be versioned, auditable, and restorable.

The Honest Counter-Argument: When Structure Becomes the Attack Vector

I've laid out a three-layer defense. Now I need to honestly examine its limits. This section isn't a formality — it's the most important part of this paper.

The Monitor's Memory Problem

The defense model assumes structure can be trusted. But OWASP ASI06 [3] explicitly warns that memory poisoning is a force multiplier — it amplifies other attacks. Consider the implication: if we deploy an AI monitor to audit the primary agent's memory (the "AI-governs-AI" model), that monitor has its own memory. Its own context. Its own vulnerability to poisoning.

If an attacker poisons the monitor's memory — training it to classify certain patterns as "normal" when they're actually malicious — the structural defense doesn't just fail. It becomes the attack vector. The monitor actively certifies compromised behavior as safe.

This isn't theoretical. The Echo Chamber Attack documented by NeuralTrust [6] demonstrates exactly this mechanism: "gradually erodes safety guardrails through benign-sounding multi-turn inputs, eventually generating policy-violating outputs." Apply this to a monitoring agent, and you get a structural defense that has been turned against itself.

The Prompt Injection Problem Remains Unsolved

The UK's National Cyber Security Centre (NCSC) has warned that "prompt injection may never be fully solvable." Their reasoning: SQL injection was solved because SQL engines have a clear instruction/data boundary (parameterization). LLMs have no such boundary — all tokens are fair game for interpretation.

This means even database-backed memory, with perfect access controls and audit trails, cannot prevent the agent itself from being tricked into making malicious database operations. The database defends against external file tampering. It does not defend against a socially engineered agent.

The Recursive Trust Problem

Structure-in-the-Loop creates a recursive trust problem:

The agent's behavior is constrained by the sandbox
The agent's decisions are informed by the database memory
The agent's identity is verified by version control
But who verifies the verifiers?

Each structural layer adds defense, but also adds attack surface. The sandbox's configuration can be targeted. The database's access control logic can be manipulated through the agent. The version control system can be undermined if the agent that manages it is compromised.

There is no layer that is inherently immune to compromise. Structure-in-the-Loop doesn't eliminate risk. It transforms the attack from a single point of failure (the human monitor who's not paying attention) into a distributed defense that requires compromising multiple independent layers simultaneously.

This is the honest conclusion: Structure-in-the-Loop is not a solution. It's a harm reduction strategy.

Why It's Still Better Than the Alternative

The alternative — human-in-the-loop — requires a human who is:

Present during all 45+ minutes of autonomous operation
Capable of evaluating every tool call in context
Not subject to alert fatigue after the hundredth routine operation
Faster than the agent at recognizing subtle memory corruption

This human doesn't exist at scale. The data shows it: interventions are dropping, autonomy is rising, and the most experienced users are the ones who oversight the least [1].

Structure-in-the-Loop doesn't need to be perfect. It needs to be better than a distracted human. And with defense-in-depth — multiple independent layers that an attacker must compromise simultaneously — it meets that bar.

The correct framing isn't "structure vs. human." It's "structure + occasional human audit of the structure."

Practical Implications

For Agent Builders

Sandbox everything. Not as an afterthought — as the foundation. Every agent session should start in an isolated environment with minimal permissions. NVIDIA's hierarchical model [5] provides a practical framework.
Move memory to a database. Markdown files are not memory infrastructure. At minimum: field-level access control, provenance tracking on every entry, anomaly detection, and rollback capability.
Make codebases disposable. If your system can't survive having its codebase deleted and regenerated, your architecture has the wrong dependencies. The memory should be the single source of truth, not the code.
Iterate trivially. Don't generate the entire system in one shot. Build incrementally, validate at each step, and keep the blast radius of any single failure small.

For SaaS Companies Becoming Agent Infrastructure

If your product is consumed by agents rather than humans, you need infrastructure-grade security:

Memory-aware API design that understands persistent agent contexts
Provenance metadata on every response (source, freshness, confidence)
Anomaly detection at the API layer for compromised calling agents
Isolation guarantees between agent sessions on your platform

For Security Teams

The threat model has changed. You're no longer defending against humans attacking software. You're defending against:

Humans attacking agent memory to manipulate software indirectly
Agents attacking other agents' memory through shared infrastructure
Poisoned agents that appear to function normally while serving attacker objectives

Traditional penetration testing doesn't cover this. You need agent-specific red-teaming that targets the memory layer, not just the application layer.

Conclusion

Technology serves laziness. Agents serve autonomy. Structure serves safety.

Human-in-the-loop was the right safety model when agents were chatbots that needed approval for every action. It's the wrong model when agents run autonomously for 45 minutes, make chains of decisions based on accumulated memory, and only call for help more often than humans think to intervene [1].

Structure-in-the-Loop — sandbox for runtime isolation, database for memory integrity, version control for identity preservation — is not perfect. The monitor's memory can be poisoned. Prompt injection may never be fully solved. The recursive trust problem has no clean resolution.

But it's better than a human who stopped paying attention twenty minutes ago.

The real question isn't whether to trust agents. It's whether to trust structure — and the answer is: more than we trust ourselves to stay vigilant.

References

[1] Anthropic. "Measuring AI agent autonomy in practice." 2026. https://www.anthropic.com/research/measuring-agent-autonomy

[2] "Sixteen Claude Agents Built a C Compiler without Human Intervention." InfoQ, February 2026. https://www.infoq.com/news/2026/02/claude-built-c-compiler/

[3] OWASP. "Top 10 for Agentic Applications 2026." https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/

[4] Anthropic Engineering. "Effective harnesses for long-running agents." https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents

[5] NVIDIA AI Red Team. "Practical Security Guidance for Sandboxing Agentic Workflows and Managing Execution Risk." https://developer.nvidia.com/blog/practical-security-guidance-for-sandboxing-agentic-workflows-and-managing-execution-risk

[6] NeuralTrust. "What is Memory & Context Poisoning?" https://neuraltrust.ai/blog/memory-context-poisoning