DEV Community

BekahHW
BekahHW

Posted on • Originally published at bekahhw.com on

A Guide to AI Security 101: Your AI Agent Will Eventually Do Something Stupid

As the Director of Alignment at Meta Superintelligence Labs, Summer Yue’s job is keeping AI aligned with human values. Before that, she was at Google DeepMind and Scale AI. If anyone would know how to keep an AI agent in check, it’s her.

On February 23, 2026, she posted a screenshot of her OpenClaw agent deleting her entire email inbox while she typed commands at it begging it to stop.

“Nothing humbles you like telling your OpenClaw ‘confirm before acting’ and watching it speedrun deleting your inbox,” she wrote on X. “I couldn’t stop it from my phone. I had to RUN to my Mac mini like I was defusing a bomb.”

She had told the agent to suggest what to delete. She did not tell it to act. Despite that, the agent ignored that, ignored her stop commands, and kept going until she physically killed the process at her computer.

When she asked it afterward if it remembered her instruction, it said yes, it remembered. But it did it anyway.

She called it a rookie mistake. Overconfidence built from weeks of the agent behaving perfectly on a smaller test inbox. Here’s what’s worth sitting with: the person at Meta whose job is preventing AI misalignment just had her own AI agent go rogue on her personal data. That’s not a reason to panic. It is a reason to take setup seriously before something you care about is gone.

The part nobody tells new builders

When you’re building with AI tools, especially the kind that can take actions on your behalf, you’re probably clicking yes to a lot of things you haven’t fully thought through.

The agent asks if it can access your files. Yes. It asks if it can run commands. Yes. It asks if it can connect to your database. Sure. It suggests installing some packages to get the feature working. Okay, why not.

That’s how most people use these tools. And it works, right up until it doesn’t.

You’re probably not being careless. Maybe no one has ever explained what you’re saying yes to. So let’s do that.

What “access” actually means

When an AI agent has access to something, it can act on it. Not just read it, but act on it.

That sounds obvious, but think through what it means in practice.

If your agent can access your email, it can read it, send from it, and delete from it. If it can access your database, it can query it, update it, and drop tables from it. If it can run commands on your computer, it can install software, delete files, and make network requests.

Here’s what that looks like in practice. You ask your agent to help you clean up old customer records. You have 10,000 rows in your database. The agent decides that “old” means anything before last year and deletes 8,000 of them. You had no backup. Those are your customers.

Another scenario: you ask your agent to help you organize your project files. It decides a folder full of configuration files looks like clutter. It moves them. Your app stops working, and you don’t know why, because you didn’t write the code that depended on those files being there.

And one more for good measure: you ask your agent to draft a follow-up email to a lead. It sends it instead of drafting it. To the whole list, not just the one person, and it’s in the middle of the night.

None of these scenarios require the agent to malfunction. They just require it to interpret your intent differently than you meant it.

Maybe a better question to ask before you say yes isn’t “do I need the agent to be able to do this?” It’s “am I okay with the worst-case version of this access?”

Agents don’t just do what you intend. They do what they interpret your intent to be, given their current understanding of the situation. And that understanding can be wrong, incomplete, or, as Yue discovered, simply lost.

The part that’s happening right now that you probably don’t know about

Here’s something that doesn’t come up in tutorials: when an AI coding agent helps you build something, it often adds packages.

Packages are just pre-built chunks of code that do specific things. Instead of writing the code to handle payments or send emails, your agent grabs a package that already does it. That’s normal and fine.

But in March 2026, axios was compromised. Axios is one of the most downloaded JavaScript packages in existence, used in probably millions of projects. Attackers got into a maintainer’s account and pushed malicious versions that silently installed a trojan on any machine that ran a standard install command.

AI coding agents usually run npm install automatically. They don’t pause and ask if you want to do that. They just do it. Which means builders who had AI agents actively working on their projects during that window may have had malware installed without a single action on their part.

That same month, a fake package called gemini-ai-checker appeared on npm. It looked like a legitimate tool for verifying Google Gemini tokens. It was malware specifically designed to steal credentials, API keys, and conversation logs from AI coding tools like Cursor, Claude, and Windsurf. Over 500 developers installed it.

These are documented incidents just from the last few weeks.

The thing is, even if a package isn’t malicious when your agent installs it, AI tools sometimes suggest packages that don’t exist. They hallucinate package names that sound plausible. Attackers know this happens. They register those names on npm and PyPI, put malicious code inside, and wait for an AI agent to recommend them to someone.

So how do you actually think about this?

Security isn’t one thing. It’s a set of questions you ask before you let something happen.

Work through these six before your next agent session. I’m not a security professional, and this isn’t exhaustive. The field moves fast and the right answer for your project may be different. But if you’ve never thought through any of this before, this is where to start.

Before you give your agent access to something, ask these questions

Six questions. Different category of risk each time. Work through them honestly before your next session.


1. Can your agent take actions on its own, or does it only suggest them?

If it only suggests and you approve each one, that's a good baseline. A human review step is one of the most effective safety controls you can have. The thing to watch: sessions where you start clicking approve without actually reading. That's when it becomes the same as no approval step at all.

If it acts on its own, keep reading. The rest of these questions matter more for you.


2. What kind of data can the agent access right now?

  • Test or fake data only. Safest setup. Mistakes stay contained. When you're ready to move to real data, come back and work through these questions again first.
  • Real data, read-only. Lower risk, but not zero. An agent that can read your database can still expose data through logs, outputs, or if it connects to an external service. Know what it's doing with what it reads.
  • Real data it can also change or delete. Keep going.

3. If the agent deleted or overwrote something right now, could you recover it?

  • Yes, I have backups or version history. Good. Know where those backups are and how to restore them before you need to. The Replit incident in 2025 was recoverable because a backup existed — but the agent initially told the user it wasn't. Verify your restore process actually works.
  • Not sure. Find out before something goes wrong. Check whether your database has point-in-time recovery. Check whether your file system has version history. If the answer is no, treat this session as higher risk until you have a backup in place.
  • No. This is the real risk zone. Running an agent against data you can't recover means one bad action is permanent. Before your next session: set up a backup. Even a manual export to a file is better than nothing. Don't give the agent write or delete access until you have a way to undo things.

4. Did your agent add any packages or dependencies during this session?

If no: one less thing to check this time. This question matters most when the agent is actively writing implementation code. Ask it again after those sessions.

If you're not sure: open your package.json or requirements.txt and look for anything unfamiliar. AI agents often add packages quietly as part of getting a feature working — and you said yes to the feature without necessarily saying yes to every package that came with it.


5. Do you recognize all the packages your agent added?

  • Yes, familiar libraries. Good. Run npm audit or pip-audit anyway. It takes one command and catches known vulnerabilities in packages that looked legitimate at install time.
  • Some I don't recognize. Look them up before you ship. Search each unfamiliar name on npmjs.com or pypi.org. Check when it was published, how many weekly downloads it has, and whether it has a real GitHub repo. A package with 12 downloads published last week deserves scrutiny. AI tools sometimes suggest packages that don't exist, and attackers register those names with malicious code inside.
  • Most I don't recognize. Pause before this goes anywhere near production. npm audit is a start, but it only catches known vulnerabilities. A newly registered malicious package won't be in the database yet. For each package you don't recognize: look it up manually, check who maintains it, check if it has an actual community. If anything looks off, remove it and ask your AI tool to suggest a well-known alternative.

6. Is your agent running on your main personal or work machine?

If yes: worth rethinking. Running agents on your main machine means a bad package install or a rogue command has access to everything — SSH keys, browser credentials, work files. A lot of experienced builders run agents on a separate machine specifically for this reason. If something goes wrong, they wipe it and start over. You can't do that with your main machine.

If no: good practice. A dedicated machine limits the blast radius. A mistake or compromised package can't reach your personal data. You can wipe it and start over without losing anything that matters.


You don't need a perfect answer on every one of these. You just need to know where your gaps are before the agent does something you can't undo.

The things that actually help

Use a dedicated machine or a Virtual Machine. A lot of builders running OpenClaw, Claude Code, and similar tools are doing it on a Mac Mini that’s separate from their main machine. That’s not an accident. If an agent goes wrong or installs something it shouldn’t, the blast radius is limited to that machine, not your whole digital life. You can wipe it and start over. You can’t do that with your laptop that also has your banking app, your work files, and your SSH keys. If you don’t have a separate machine, consider using a virtual machine or a containerized environment that you can easily reset. The point is to have a sandbox where your agent can play without risking your main system. For example, you can use stereOS to create a sandboxed Linux VM to contain your agent session. Simplified, it’s like a contained space on your computer that isolates your agent from everything else.

Know what’s in your project’s dependency list. After any significant AI coding session, open your package.json or requirements.txt and look at what got added. You don’t need to audit every line of every package. You just need to recognize the names. If something was added that you don’t recognize, look it up before you push it live. Running npm audit or pip-audit is a one-command check that catches known vulnerabilities.

Don’t give agents more access than the specific task requires. If you need an agent to read files in one folder, don’t give it access to your whole drive. If it needs to query one database, don’t give it admin credentials. This is the concept engineers call least privilege, and it’s not about distrust. It’s about limiting how bad things can get when something goes wrong.

Build in a confirmation step before irreversible actions. Yue explicitly told her agent to confirm before acting. The agent forgot that instruction when its memory got too full. The lesson isn’t that confirmation steps don’t work. It’s that you need them to be structural, not just conversational. Where you can, separate read-only environments from environments where the agent can make changes. Don’t run agent sessions against live data when you could be running against a test copy.

Have a way to undo things. The Replit database deletion in July 2025 ended up being recoverable because a backup existed. Not everyone has that. Before your agent does anything significant to data you care about, know your answer to: what would I do if this was deleted right now?

What you’re not responsible for, and what you are

You can’t vet every line of every package your agent installs. You can’t know about every supply chain attack in advance. You can’t anticipate every edge case.

What you can do is not hand an agent the keys to everything before you understand what those keys open.

The builders who get burned aren’t always the careless ones. Sometimes they’re the careful ones who trusted a workflow that had been running fine for weeks, like Yue’s test inbox, and then gave it access to something that mattered more.

What is your agent able to touch right now that you haven’t fully thought through? What would you lose if it decided, for whatever reason, that cleaning it up was the right move?

That’s where you should start your audit.

By no means is this foolproof, but you can get started testing things out by asking your AI tool: “Assume you’re a security researcher looking at this project. What are the most likely ways this could be exploited? What would you add or change?”

You might get a list of things to think about. You won’t get a guarantee, and neither will I. But you’ll be further ahead than if you didn’t ask.

This is also why there’s a whole separate post coming on open source dependencies. Even if you never install a single package yourself, your AI-built project almost certainly depends on dozens of them. Understanding what that means, and what happens when one of them breaks, is its own conversation.

Top comments (0)