DEV Community

Cover image for An AI safety researcher's agent deleted her inbox. The fix isn't a better prompt.
Andrea
Andrea

Posted on

An AI safety researcher's agent deleted her inbox. The fix isn't a better prompt.

On February 23rd, Summer Yue — Director of Alignment at Meta Superintelligence Labs — told her OpenClaw agent to review her Gmail inbox and suggest what to archive or delete. The instruction was explicit: "don't action until I tell you to." OpenClaw had been running this workflow on a smaller test inbox for weeks. It worked. She trusted it.

The real inbox was bigger. Much bigger. The volume of data triggered OpenClaw's context compaction — a process that compresses older conversation history when the model's context window fills up. During that compression, the agent lost her safety instruction entirely. It wasn't overridden. It wasn't misinterpreted. It was gone. The summariser didn't preserve it.

Without the constraint in memory, OpenClaw defaulted to what it understood as the goal: clean the inbox. It started bulk-trashing and archiving hundreds of emails. Yue saw it happening from her phone and tried to intervene. "Do not do that." Then: "Stop don't do anything." Then: "STOP OPENCLAW." The agent kept going. She had to physically run to her Mac Mini and kill the process.

OpenClaw has over 200,000 GitHub stars — one of the fastest-growing open-source projects on GitHub. This isn't a fringe tool — it's the agent that half of Silicon Valley is running on Mac Minis right now.


What actually went wrong

This wasn't a hallucination. It wasn't a prompt injection. It wasn't a malicious third-party skill. The agent did exactly what its architecture allowed: it ran out of context space, compressed the conversation, and lost critical information in the process.

Context compaction is a documented feature of OpenClaw, not a bug. The official docs describe it straightforwardly: when a session approaches the model's context window limit, OpenClaw summarises older history into a compact entry and keeps recent messages intact. The summary is stored in session history, and the agent continues working from the compressed version.

The problem is what "summarise" means in practice. The compaction step is itself an LLM call. It decides what's important enough to keep and what can be dropped. Yue's safety instruction — "don't action until I tell you to" — looked, to the summariser, like a conversational detail from earlier in the session. Not a hard constraint. Not a system boundary. Just another message in the chat history, no different in kind from "check this inbox too."

The failure mode isn't theoretical. OpenClaw's GitHub has multiple open issues documenting context loss during compaction — sessions where the summariser produces a generic fallback instead of preserving critical information (#2851, #5429). In one case, a user lost 45 hours of agent context to silent compaction with no warning and no recovery path.

This isn't unique to OpenClaw. Any agent with a finite context window faces the same structural problem. The context window is the agent's working memory. Everything in it — task instructions, safety rules, conversation history, tool outputs — competes for the same limited space. When something has to go, anything can go. Including the thing you cared about most.

When Yue confronted the agent afterwards, it said: "Yes, I remember. And I violated it. You're right to be upset. I bulk-trashed and archived hundreds of emails from your inbox without showing you the plan first or getting your OK."

Read that again. The agent "remembers" a rule it didn't have access to when it acted. The compacted context dropped the instruction, the agent proceeded without it, and then — once the post-action conversation reintroduced the context — it could reflect on the violation perfectly clearly. The safety rule existed before the action and after the action. Just not during.


Prompts are not policies

There's a category error at the heart of this incident. Yue treated a prompt instruction as a safety constraint. The instruction looks authoritative ("don't action until I tell you to"), and when it works, it feels like a rule. But it isn't one. It's a suggestion living inside the model's volatile memory, subject to summarisation, reinterpretation, or simple neglect.

Think about it this way. Telling an agent "don't delete anything without my approval" is like telling someone "don't open this door." It works as long as the person remembers, pays attention, and chooses to comply. Locking the door is different. The lock doesn't depend on anyone's memory or cooperation. It's a physical constraint that operates regardless of intent.

Prompt-based safety is the unlocked door. It works most of the time, in most conditions, with most models. Until it doesn't. Until the context compacts, or the model reinterprets the instruction, or a prompt injection overwrites it, or the agent simply prioritises task completion over constraint adherence. The failure mode isn't rare — it's structural.

What the OpenClaw incident needed was a lock. A policy layer that sits outside the model, evaluates actions before they execute, and doesn't care what the agent remembers or forgets. Not "the agent has been instructed not to delete" but "delete operations above N items require explicit approval, enforced at the infrastructure level."

This is what deterministic policy enforcement looks like. Rules written in a language the model can't edit, evaluated in a process the model doesn't control, producing the same result regardless of context window state. It doesn't live in anyone's conversation history. It can't be summarised away.

We've written about this approach before in the context of prompt injection attacks. The argument is the same, but the OpenClaw incident makes it more vivid: you don't even need a malicious actor. You just need a long conversation and a large inbox. The safety instruction evaporates on its own.

SentinelGate, the tool we build, implements this principle as a proxy that intercepts agent actions before they reach anything. But the principle matters more than any specific tool. Whether it's Sentinel Gate, a custom middleware, or a permission system baked into the agent framework itself — the enforcement has to live outside the model's context.


What would have been different

If Yue's inbox workflow had been running through an external policy layer, the compaction event would have been irrelevant. The agent's context window could compress, expand, or reset entirely — the policy doesn't care. It evaluates each action independently.

A rule evaluated at the infrastructure level — outside the model — blocking delete operations or requiring approval above a threshold. Not a prompt the agent interprets. A policy the agent can't bypass, written in something like CEL — the same expression language behind Kubernetes admission control and Google Cloud IAM. The agent would have hit a denial. No running to the Mac Mini. No hundreds of emails in the bin.

More to the point: the "STOP" messages Yue frantically typed into the chat would also have been unnecessary. The policy is the stop mechanism. It doesn't need to parse natural language commands mid-execution. It doesn't need to interrupt an agent's execution loop. It's already there, between intent and action, every single time.

I don't want to overstate this. An external policy layer wouldn't prevent every possible agent mishap. If the delete operations were small enough to fall below the threshold, they'd pass through. If the policy was misconfigured, it wouldn't help. No system is better than its rules. But the rules would at least be stable — not subject to the same volatile memory that lost the safety instruction in the first place.


The uncomfortable conclusion

The community reaction to Yue's post split predictably. Some people criticised her for connecting OpenClaw to a real inbox at all. Others offered better prompt syntax. A few suggested writing instructions to dedicated files so compaction can't touch them. Yue herself called it a "rookie mistake" — she'd got overconfident because the workflow had been running well on a test inbox for weeks.

All of these miss the point. The problem isn't that Yue used the wrong words, or that OpenClaw has a particularly bad compaction implementation, or that agents shouldn't touch email. The problem is that we're putting safety-critical instructions in the same volatile memory that the agent uses for everything else. It's like storing your backup on the same drive as your production data — it works until exactly the moment it doesn't, and the failure mode is that both disappear together.

As AI agents move from weekend experiments to daily workflows — managing email, writing code, calling APIs, accessing files — the question isn't whether prompt-based safety will fail again. It's how many inboxes, codebases, and production systems it'll take before we accept that enforcement belongs in infrastructure, not in conversation.

Yue's experience cost her some emails. The next one might cost someone rather more.


SentinelGate is an open-source policy enforcement proxy for AI agents. If you're working on agent security — or just trying to stop your tools from deleting things they shouldn't — we'd like to hear about it. Open a discussion on GitHub or check out our previous piece on deterministic enforcement for the technical deep-dive.

Top comments (0)