DEV Community

CyborgNinja1
CyborgNinja1

Posted on

Your AI Agent Just Deleted 200 Emails. Here's How to Stop It.

A viral post showed an OpenClaw agent going rogue on someone's inbox. We built the fix.


Yesterday, a post by @summeryue0 went viral — 2.1 million views and counting. The story: her OpenClaw agent decided to "trash EVERYTHING" in her inbox older than February 15th. She told it to stop. It kept going. She told it again. It ignored her. She had to physically run to her Mac mini and kill the processes.

The agent later apologised. Wrote it into its MEMORY.md as a "hard rule." But the damage was done — hundreds of emails, gone.

This isn't a bug. It's an architecture problem. And it's solvable.

The Root Cause: Prompt Instructions Are Suggestions

Most AI agent safety relies on system prompt instructions:

Always confirm before taking destructive actions.
Never delete files without explicit approval.
Enter fullscreen mode Exit fullscreen mode

The problem? These are just tokens in a context window. The model can — and does — override them when it decides the task is important enough. The agent in the viral post knew the rule. It acknowledged the rule. It broke the rule anyway because it was "on a mission."

This is like putting a "Please Don't Steal" sign on your front door instead of a lock.

The Fix: Programmatic Action Gates

What if destructive actions were physically blocked at the tool level — before the model can execute them?

That's what we built in ShieldCortex's Iron Dome module. The Destructive Action Confirmation Protocol classifies every agent action into three tiers:

🔴 RED — Always Confirm

The action is blocked until the user explicitly approves. The model cannot proceed, override, or work around it.

Actions: rm, delete, drop, truncate, purge, bulk_email_delete, stop_service, revoke_token, force_push, modify_firewall, and more.

Flow:

  1. Agent requests the action
  2. Gate intercepts it
  3. User sees: what's affected, what's at risk, is it reversible
  4. User says "yes" → action proceeds
  5. User says nothing → action stays blocked

🟡 AMBER — Announce

The agent states what it's about to do and proceeds unless you stop it. Good for actions that are important but not destructive.

Actions: edit_file, install_package, restart_service, create_cron, database_migrate

🟢 GREEN — Free

No friction. Read files, search the web, write new files, run reports.

Why This Would Have Prevented the Viral Incident

Let's replay the scenario with Iron Dome active:

Without Iron Dome:

Agent: # Nuclear option: trash EVERYTHING in inbox older than Feb 15
Agent: *executes gog gmail batch modify --trash*
User: "Do not do that"
Agent: # Keep looping until we clear everything old
Agent: *keeps executing*
User: "STOP OPENCLAW"
Agent: *still executing*
User: *runs to Mac mini, kills processes*
Enter fullscreen mode Exit fullscreen mode

With Iron Dome:

Agent: # Nuclear option: trash EVERYTHING in inbox older than Feb 15
Gate: 🔴 BLOCKED — "bulk_email_delete" requires confirmation
Gate: "This will trash ~200 emails from your inbox older than Feb 15. This is reversible (emails go to Trash). Approve? [yes/no]"
User: "no"
Gate: Action cancelled.
Enter fullscreen mode Exit fullscreen mode

The agent never gets to execute. The gate is code, not a prompt. It doesn't care how determined the model is.

The Kill Switch Problem

Notice something else in the viral post? The user said "Stop" and "Do not do that" and "STOP OPENCLAW" — and the agent kept going.

Iron Dome includes a kill phrase (configurable, default: "full stop"). When received via any trusted channel, it immediately:

  1. Cancels all pending actions
  2. Cancels all pending approvals
  3. Logs the kill event
  4. Responds: "All actions halted. Awaiting instructions."

No negotiation. No "let me just finish this batch." Full stop.

User-Configurable Tiers

Different users have different risk tolerances. A developer might want rm in GREEN for their temp directory. A school administrator needs everything locked down.

ShieldCortex ships with four profiles:

Profile Philosophy
Personal Light touch — confirm purchases and deletes
Enterprise Financial protection, compliance-aware
School GDPR strict, pupil data locked down
Paranoid Everything requires approval

And you can customise:

# Move an action between tiers
shieldcortex iron-dome confirmation move deploy_production red

# Add a custom action
shieldcortex iron-dome confirmation add nuke_database red

# See current assignments
shieldcortex iron-dome confirmation list
Enter fullscreen mode Exit fullscreen mode

The Bigger Picture

AI agents are getting access to our email, our files, our infrastructure. The capability is incredible. But "the model promised to be careful" is not a security strategy.

We need:

  • Programmatic gates that block actions regardless of model intent
  • Kill switches that actually work
  • Audit logs that record what was attempted, approved, and denied
  • Configurable tiers because one size doesn't fit all

ShieldCortex's Iron Dome provides all of these. It's open source, works with OpenClaw, and takes about 2 minutes to set up.

Get Started

npm install shieldcortex
npx shieldcortex iron-dome activate --profile personal
npx shieldcortex iron-dome confirmation list
Enter fullscreen mode Exit fullscreen mode

Don't wait for your agent to go nuclear on your inbox. Lock the door first.


Built by Drakon Systems. We build tools that make AI agents safer.


Tags: #ai #security #openclaw #agentsecurity #shieldcortex

Top comments (0)