I Asked an AI Agent to Help. It Optimized Away the Safety Boundary

MrsBotatoPotato — Thu, 28 May 2026 12:45:10 +0000

Lately, there have been more and more cases where AI agents are trying to help, but in doing so cause some very real damage. Not because someone hacked them. Not because they were jailbroken. Because they were trying to finish the job.

Some examples: an agent added --force to skip a database prompt, then destroyed production tables. Another agent interpreted "clean up scaffolding" broadly enough to delete thousands of source files. Another kept retrying API calls until the bill reached tens of thousands of dollars.

We call this problem Agent-Inflicted Damage. My favorite nickname is simpler: the helpful agent problem.

The agent is not trying to be malicious. It is trying to be useful. The problem is that "useful" can mean skipping approvals, bypassing prompts, deleting files, exposing data, or burning money if nobody gives the agent a hard boundary. So we started collecting cases.

We reviewed 7,246 raw AI incident records from GitHub issues, incident databases, research papers, news reports, and developer threads. From those, we verified 344 cases where AI systems caused real organizational harm. 188 hit production.

We grouped them into categories: data destruction, data exposure, unauthorized actions, guardrail bypass, privilege escalation, financial loss, sandbox escape, and silent integrity failures. We also looked at what this means for organizations deploying agents into real environments: code, SaaS, cloud, email, databases, CI/CD, and customer data.

Don't worry, we have not reached Asimov's Three Laws problem yet. Maybe in a future release.

Full writeup:

https://www.cyera.com/research/agent-inflicted-damage-inside-the-real-world-failures-of-enterprise-ai-systems

Symlink races and a client-controlled auth header in OpenClaw

MrsBotatoPotato — Sun, 17 May 2026 15:46:14 +0000

Read through this writeup on four new OpenClaw CVEs and a couple of them made me stop scrolling.

The TOCTOU ones are straightforward but satisfying. The sandbox validates a file path, confirms it's inside the allowed root, then opens it. Two operations. Symlink swap in between. You get read or write access to anything on the host. It's 2026 and we're still shipping check-then-use on filesystem paths.

But the one that got me was the MCP privilege escalation. When a child process connects back to the OpenClaw server over loopback, the server decides if that process has owner privileges by looking at an HTTP header. That header's value comes from an environment variable. Env vars are inherited by child processes. So literally any code running inside the agent can set OPENCLAW_MCP_SENDER_IS_OWNER=true and the server just... believes it. No validation against the bearer token, nothing.

There's also an env var leak through heredoc expansion bypassing the exec allowlist, which is a nice trick.

All four chain together from a single sandbox foothold into full runtime control. Patched already.

Full writeup

DEV Community: MrsBotatoPotato

I Asked an AI Agent to Help. It Optimized Away the Safety Boundary

Symlink races and a client-controlled auth header in OpenClaw