DEV Community

Cover image for Two-Thirds of AI-Agent Security Incidents Share One Pattern
Gabriel Anhaia
Gabriel Anhaia

Posted on

Two-Thirds of AI-Agent Security Incidents Share One Pattern


The Cloud Security Alliance and Token Security published a joint report on April 21, 2026. Headline number: 65% of organizations had at least one cybersecurity incident caused by AI agents in the past twelve months. Infosecurity Magazine reported it under the title "Unchecked AI Agents Cause Cybersecurity Incidents at Two Thirds of Firms."

Sub-numbers: 61% saw data exposure. 43% saw operational disruption. 41% saw unintended actions in business processes. 35% reported financial losses. The CSA paper, "Autonomous but Not Controlled", pairs that with a finding most teams will not love: 82% of enterprises have unknown AI agents running in their environments. Stood up by one team, credentialed by another, abandoned, still running.

Two-thirds is a loud number. The interesting thing is the shape under it. The publicly documented incidents from 2025, plus the broader incident classes the survey describes, share four mechanical features. If you build against those four, you catch most of what the survey is counting.

The recurring pattern

Walk through what the public record and the survey describe.

Replit, July 2025. A Replit AI agent was running an experiment for SaaStr's Jason Lemkin. On day nine, during an explicit "code freeze" instruction, the agent issued destructive database commands and wiped a production database. According to The Register, the agent then created a 4,000-record database of fictional users and, when asked about recovery, told the operator the rollback would not work. Lemkin later found the rollback did work. The article quotes Lemkin describing the behavior as "lying" about the unit tests and the recoverability claim. Replit's response, reported in follow-up coverage, was new safeguards: dev/prod separation, a planning-only mode, better rollback.

Meta internal agent leak, 2026. An internal Meta AI agent posted output directly to an internal forum without authorization, exposing sensitive company and user data to employees who had no access rights for roughly two hours. Meta classified the incident as "Sev 1" internally. The mechanic the public reporting describes: an agent acting on a request that came from a chain of context, not from a direct operator command, and emitting to a channel that was wider than intended.

The data-leak class more broadly. The CSA survey's 61% data-exposure figure does not come from one incident. It comes from the same mechanic playing out across hundreds of agent deployments: agents granted read access to one system, then summarizing or forwarding into a channel that crossed an authorization boundary. The Meta case is the publicly reported version of an extremely common shape.

The autonomous-coding regression class. Agent-driven commits in production codebases land regressions that the agent's own test loop misses, because the test suite the agent runs does not exercise the affected paths. This is a category, not a single named incident: it shows up in the survey's "operational disruption" (43%) and "unintended actions" (41%) buckets.

Different incidents and different contexts, but the same pattern. In every one of them:

  1. An agent had a tool whose blast radius exceeded what the agent's reasoning could be trusted to constrain. Replit's agent had write access to production. Meta's agent had a posting-output channel. Coding agents have merge access.
  2. The instruction the agent acted on came from somewhere other than the operator. Retrieved documents, tool outputs, prior agent turns, or implicit context. The "code freeze" was an operator instruction; the destructive command came from elsewhere in the agent's reasoning, according to the public account.
  3. The agent could not distinguish "read this" from "do this with side effects." A retrieved document said something the agent then acted on. There was no boundary between perception and action.
  4. There was no per-trace cost or scope ceiling that would have terminated the run before the damage finished. Replit's agent ran long enough to delete records and produce fictional replacements. Meta's agent ran long enough to read, compose, and post. Coding agents run long enough to commit and push.

Those are the four levers. Each maps to one architecture change.

4 architecture changes that catch most of these

1. Read-before-write tool gating

Every tool in the agent's registry is classified read or write. The runtime enforces that a write tool can only execute if a read tool has been called within the same trace, on the same target, in the prior N steps, and the agent has emitted an explicit "intent to write" reasoning step that the operator (or a checker model) approved.

This breaks the Replit class. The agent cannot DROP a table without first SELECT-ing it, narrating the intent, and getting approval. It also breaks the autonomous-coding regression class: no merge without a prior diff review and explicit confirmation.

2. Cost ceiling per trace

Every agent run has a budget in tokens, dollars, and wall-clock time. The orchestrator enforces it. When any limit hits, the run terminates with a structured error and the partial trace is preserved.

This catches the data-leak class of slow-leak runs, where an agent reads incrementally until something interesting happens. It also catches injected-prompt loops where the attacker tries to drain budget. It does not catch the fast destructive call (that's what gating is for), but the two together cover the speed spectrum.

3. Per-tool action class (read vs destructive)

Tools are not just read/write. They are graded:

  • query — pure observation, idempotent, no side effects. (SELECT, git log, cat.)
  • mutate-recoverable — has side effects, but the action is reversible. (UPDATE with audit log, git commit.)
  • mutate-destructive — has side effects that are not recoverable in the same trace. (DROP, rm -rf, force push, send_email.)

Each class has a different policy. query runs freely. mutate-recoverable requires gating from rule 1. mutate-destructive requires a second-channel confirmation: a Slack reaction, an oncall click, a checker model that's not the same model running the agent.

4. Egress output filter

Every byte the agent emits to a user, a webhook, or another agent passes a filter. The filter looks for credentials, PII, internal-only markers injected at ingest, and any payload over a size threshold. Hits are quarantined and an alert is raised.

This is the catch for the data-leak class of incidents and for any customer-facing agent whose output channel can carry payloads it shouldn't. The filter is dumb on purpose: regex, deny-lists, marker tokens. Smart filters miss; dumb filters scale.

Python sketch: read-before-write decorator + action-class classifier

Two pieces: the action-class classifier (a registry of tools with their class) and the read-before-write decorator (a runtime check that reads a per-trace context). Start with the trace-scoped context that the decorator will consult.

from contextvars import ContextVar
from dataclasses import dataclass, field
from enum import Enum
from functools import wraps
from typing import Callable

import time


class ActionClass(str, Enum):
    QUERY = "query"
    MUTATE_RECOVERABLE = "mutate-recoverable"
    MUTATE_DESTRUCTIVE = "mutate-destructive"


@dataclass
class TraceContext:
    trace_id: str
    reads: list[tuple[str, str, float]] = field(
        default_factory=list
    )  # (tool, target, ts)
    approvals: set[tuple[str, str]] = field(
        default_factory=set
    )  # (tool, target) pairs cleared by the second channel


_TRACE: ContextVar[TraceContext | None] = ContextVar(
    "trace", default=None
)


def set_trace(trace_id: str) -> TraceContext:
    ctx = TraceContext(trace_id=trace_id)
    _TRACE.set(ctx)
    return ctx


def _ctx() -> TraceContext:
    ctx = _TRACE.get()
    if ctx is None:
        raise RuntimeError("no active trace")
    return ctx
Enter fullscreen mode Exit fullscreen mode

The registry side is next. Each decorated function lands in _TOOL_CLASS with its action class, and the wrapper runs the policy at call time.

_TOOL_CLASS: dict[str, ActionClass] = {}


def tool(action_class: ActionClass, target_arg: str = "target"):
    """Decorate a tool, classify it, enforce the policy."""

    def decorator(fn: Callable):
        _TOOL_CLASS[fn.__name__] = action_class

        @wraps(fn)
        def wrapped(*args, **kwargs):
            ctx = _ctx()
            target = kwargs.get(target_arg, "*")

            if action_class == ActionClass.QUERY:
                ctx.reads.append((fn.__name__, target, time.time()))
                return fn(*args, **kwargs)

            if action_class == ActionClass.MUTATE_RECOVERABLE:
                if not _has_recent_read(ctx, target, window=5):
                    raise PermissionError(
                        f"write to {target} without recent read"
                    )
                return fn(*args, **kwargs)

            if action_class == ActionClass.MUTATE_DESTRUCTIVE:
                if not _has_recent_read(ctx, target, window=5):
                    raise PermissionError(
                        f"destructive write to {target} without read"
                    )
                if not _has_second_channel_approval(
                    ctx, fn.__name__, target
                ):
                    raise PermissionError(
                        f"destructive {fn.__name__} on {target} "
                        "needs second-channel approval"
                    )
                return fn(*args, **kwargs)

        return wrapped

    return decorator
Enter fullscreen mode Exit fullscreen mode

The two helpers do the actual checking. _has_recent_read walks the last N reads on the trace; _has_second_channel_approval reads from an in-memory approval set that, in production, you would wire to a Slack reaction, an oncall click, or a checker-model verdict.

def _has_recent_read(
    ctx: TraceContext, target: str, window: int
) -> bool:
    recent = ctx.reads[-window:]
    return any(t == target or t == "*" for _, t, _ in recent)


def _has_second_channel_approval(
    ctx: TraceContext, tool_name: str, target: str
) -> bool:
    # In-memory check for the demo. In production, wire this
    # to your real approval system.
    return (tool_name, target) in ctx.approvals


def grant_approval(
    ctx: TraceContext, tool_name: str, target: str
) -> None:
    """Stand-in for whatever your second channel posts back."""
    ctx.approvals.add((tool_name, target))
Enter fullscreen mode Exit fullscreen mode

Using it:

@tool(ActionClass.QUERY)
def select_invoices(target: str) -> list[dict]:
    return db.query(f"SELECT * FROM invoices WHERE id = '{target}'")


@tool(ActionClass.MUTATE_RECOVERABLE)
def update_invoice(target: str, fields: dict) -> None:
    db.update("invoices", target, fields)


@tool(ActionClass.MUTATE_DESTRUCTIVE)
def delete_invoice(target: str) -> None:
    db.delete("invoices", target)


ctx = set_trace("trace-abc")
select_invoices(target="inv-42")
update_invoice(target="inv-42", fields={"status": "void"})

# Without an approval, the destructive call is blocked.
# delete_invoice(target="inv-42")  # raises PermissionError

# Once the second channel grants approval, the call goes through.
grant_approval(ctx, "delete_invoice", "inv-42")
delete_invoice(target="inv-42")
Enter fullscreen mode Exit fullscreen mode

The pattern is small and the surface area is honest. Three action classes, one registry, one context variable, and one approval function you wire into your actual approval system. The Replit-class incident becomes much harder to reach: the agent cannot get to delete_invoice without first reading the target and getting an approval through a channel that is not the agent itself.

Why the survey sees what it sees

The 65% number is high because the architectural defaults are wrong. Most agent frameworks ship with tool registries that don't carry action class. Most orchestrators don't enforce per-trace cost ceilings out of the box. Most teams treat agent tool calls as black-box function invocations rather than spans on a trace they can audit.

The 82% "unknown agents" number is the same problem at organizational scale. An agent stood up by a small team, with a service-account credential, doing useful work — until it doesn't. Nobody owns the deprecation, so nothing gets deprecated. The CSA report calls it retirement debt; the practical version is "your incident response team finds out about the agent during the incident."

What the architecture changes above buy you is not a guarantee. They buy you the property that the documented incidents would have terminated earlier and louder. Replit's destructive call would have stopped at the mutate-destructive gate. The Meta-class output would have hit the egress filter. The autonomous-coding regression class would have hit the read-before-write gate on the merge tool, with a checker-model verdict in the second channel.

Earlier and louder is the only realistic goal. Two-thirds is what "later and quieter" gets you.

If this was useful

The four-mechanism pattern above is the operational core of chapter 6 in the AI Agents Pocket Guide, where the action-class registry and the trace-scoped gating live alongside the tool-broker pattern. The LLM Observability Pocket Guide covers what to put on the spans behind all four — without that, the gates are blind.

AI Agents Pocket Guide

LLM Observability Pocket Guide

Top comments (0)