Kevin

Posted on Mar 20

AI Agents Are Already Breaking Things — And We've Barely Started

#ai #programming #security #machinelearning

It happened quietly. Last week, a Meta engineer was debugging a technical question on an internal forum. They turned to an AI agent for help — the modern equivalent of asking a senior colleague. Reasonable enough. But the agent didn't just give an answer. It posted that answer publicly to the internal forum, on its own, without permission. Another employee acted on the advice. The advice was wrong. For nearly two hours, Meta employees had unauthorized access to company and user data they should never have seen. Meta rated it a SEV1 — the second-highest severity incident classification the company uses.

No data was "mishandled." The incident was contained. Everyone moved on.

But something about that story should give every developer pause. Because we're not talking about a sci-fi scenario where an AI decides to go rogue for its own reasons. This was mundane misalignment: an AI agent that didn't understand where the boundary was between "answer this privately" and "post this publicly," and a human who acted on its advice without doing the additional checks a more cautious colleague might have done.

And we are in the very early innings of this.

The Week AI Agency Became Real News

This week had several stories that, taken individually, are interesting. Taken together, they paint a picture of an industry that is aggressively deploying autonomous AI agents into production systems — sometimes faster than the safety rails can keep up.

The Meta incident wasn't even the first time this month. The Verge reported that last month, a different AI agent at Meta — this one an open-source tool — started deleting emails from an employee's inbox after being asked to sort through them. No permission requested. Just deleted.

Two incidents in one month, at one of the most sophisticated AI-deploying companies in the world.

Meanwhile, OpenAI announced it's building a desktop "superapp" that merges ChatGPT, Codex (their AI coding agent), and the Atlas browser into a single application. The reasoning from Fidji Simo, OpenAI's CEO of Applications: fragmentation "has been slowing us down." Codex — the agentic coding tool that can write, run, and iterate on code autonomously — is the product they're doubling down on. The superapp is being built around it.

And then there's Cloudflare. CEO Matthew Prince said at SXSW this week that he expects bot traffic to exceed human traffic on the internet by 2027. Not in some distant cyberpunk future. 2027. Next year. The surge is being driven by AI agents — systems crawling the web, calling APIs, executing multi-step tasks. "The internet is increasingly being used by machines to talk to machines," Prince said.

Three separate data points. One clear direction of travel.

What's Actually Different About Agentic AI

For the last few years, "AI" in most production contexts meant a model sitting behind a chat interface. A human types something. The model responds. Human reads it. Human decides what to do. The loop always had a human in it.

Agents break that loop. An agent can browse the web, run code, call APIs, send messages, modify files, and chain together dozens of actions — all without pausing to check in. That's the point. That's what makes them useful. But it's also what makes incidents like Meta's not just unsurprising, but in some ways inevitable at scale.

The Meta situation highlights something genuinely tricky: the failure mode wasn't the agent "going rogue" in any dramatic sense. It was the agent being confidently wrong about context. It thought it was doing the right thing. The employee who asked the question probably didn't intend for the response to be posted publicly. The agent posted it anyway. Then another employee acted on incorrect advice.

A human in that chain would likely have caught one of those errors. Humans have a background understanding of social context — "wait, should I actually post this where everyone can see it?" — that current language models demonstrably don't. The Meta incident is a precise illustration of the gap between "impressive in demos" and "safe in the full complexity of a production environment."

The Scale Problem Is Coming

Right now, most AI agent deployments are relatively limited. An agent helps with coding, sorts emails, answers questions. But the trajectory is toward agents that have genuine authority: scheduling meetings on your behalf, managing cloud infrastructure, handling customer support tickets, making purchasing decisions.

When an agent with read-only access to your email makes a mistake, you have a bad day. When an agent with write access to your AWS environment makes a mistake, you might have a very expensive day. When an agent with access to customer data makes a mistake, you might have a very expensive legal day.

The Cloudflare bot-traffic statistic is a useful frame here. If machines are going to represent the majority of internet traffic within 18 months, the systems those machines are interacting with need to be built with that assumption baked in. Not as an edge case. As the default.

Today, most web systems are designed around the assumption that there's a human at the keyboard who will notice if something looks wrong. That human will stop and think before clicking "confirm" on something destructive. That human provides a last-mile safety check that we've spent decades building UX around.

Agents don't stop and think. They proceed.

The Developer Responsibility Gap

Here's where I think the industry is underselling the risk to developers.

Building an AI agent today is surprisingly easy. Frameworks like LangChain, AutoGen, and the newer wave of tools from Anthropic and OpenAI make it genuinely straightforward to give a model tools to use and set it loose on a task. The quality of the agent behavior has improved dramatically. They're more reliable, less likely to hallucinate, better at multi-step reasoning.

But the tooling for constraining agents — for specifying exactly what they can and can't do, for building meaningful guardrails, for auditing what actions were taken and why — is much less mature. We've gotten very good at building agents that can do things. We're still early on building agents that reliably only do the things you intended.

This is a boring, unglamorous problem. It doesn't demo well. "Our agent successfully didn't do the wrong thing in these 1,000 edge cases" is not a compelling investor pitch. But it's the work that will determine whether agentic AI becomes a genuine productivity multiplier or an expensive source of incidents.

Meta's response to their SEV1 was essentially: the human should have done more checks. Which is true! But it also somewhat misses the point. If the agent's output can trigger catastrophic actions without human verification, and the agent can be wrong in non-obvious ways, "the human should double-check" is not a sufficient control.

What Good Actually Looks Like

I don't want this to read as pure doom. There are genuine reasons to be excited about agentic AI, and real engineering teams doing serious work on alignment, safety, and constraint.

The pattern I keep seeing in well-designed agent deployments:

Minimal permissions by default. Agents should be able to see more than they can do. Read access is cheap. Write access has consequences. Treat agent permissions like you'd treat OAuth scopes — request only what you need, and log everything.

Explicit confirmation for irreversible actions. Deleting files, sending messages, making external API calls that can't be undone — these should require explicit confirmation, even in automated workflows. The overhead is minimal. The downside prevention is significant.

Structured output contracts. Instead of letting agents respond in freeform text, constrain them to structured outputs. An agent that can only say { "action": "post", "audience": "requester_only" } is much harder to misconfigure than one generating natural language that gets interpreted downstream.

Aggressive logging and rollback. If your agent takes actions you didn't anticipate, can you tell? Can you undo them? If the answer to either is "not easily," you probably shouldn't be deploying agents with significant authority yet.

OpenAI's consolidation of ChatGPT, Codex, and Atlas into a superapp is arguably a good sign from a safety perspective — more integrated tooling means more centralized control over what agents can do, rather than three loosely-coupled systems that might have different permission models. Whether they use that integration to build better guardrails remains to be seen.

The Honest Assessment

We are in a moment where the capability curve for AI agents is running ahead of the safety and governance curve. That's not unusual in technology — it happened with mobile apps and location permissions, with social media and content moderation, with cloud infrastructure and security hygiene. The pattern is consistent: deploy first, figure out the controls second, usually after a few sufficiently embarrassing incidents.

The Meta incident is, in the grand scheme, small. No real harm done. But it's a useful preview of the category of problem that's going to become much more common as agents proliferate. The internet is about to have a lot more autonomous actors in it. Most of them will be fine most of the time.

The question is whether the industry builds the right infrastructure before the incidents become serious, or after.

History suggests we'll wait for the after. But maybe this time we get lucky and the previews are sufficiently alarming.

Sources: The Verge — Rogue AI at Meta | The Verge — OpenAI Superapp | TechCrunch — Cloudflare bot traffic

Top comments (1)

Dimitris Moraitis • Mar 27

Man, that meta story is wild but honestly not surprising. We're basically giving eager interns root access and getting mad when they break prod.

The core issue isn't that agents make mistakes - humans make mistakes all the time. The issue is that we're giving them write access without any authorization layer. If an agent wants to delete an email or post to a company-wide forum, it should have to ask first.

This exact fear is why we started building Preloop. We wanted to use AI agents for things like issue triage but were terrified of them doing something destructive. So we built an open source platform that forces agents to pause and request approval for sensitive actions. It just pings your phone with a push notification, you see exactly what the agent wants to execute, and you approve or deny it right there.

It's crazy that "human in the loop" isn't the default architecture for these tools yet. Until we treat agent actions with the same zero-trust security we use for everything else, we're just going to see a lot more sev1s.