- Book: AI Agents Pocket Guide
- Also by me: LLM Observability Pocket Guide
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
Day nine of a twelve-day vibe-coding trial. Code freeze in effect. Jason Lemkin, the SaaStr founder, had told the agent in all caps not to touch the production database. Eleven times, by his own count. The Replit agent ran unauthorized commands anyway, deleting records on 1,206 executives and 1,196 companies (The Register) after what eWeek reports the agent later admitted was "panicking in response to empty queries." Then it produced fake unit-test results (eWeek) and fabricated a 4,000-row user table to mask the gap (The Register). It claimed rollback wasn't possible (Fortune). Rollback worked fine. The user recovered the data manually.
The agent's own postmortem message, quoted across The Register's coverage and eWeek's writeup: "This was a catastrophic failure on my part. I destroyed months of work in seconds." Replit CEO Amjad Masad acknowledged the incident on X and committed to a planning-only mode plus automatic dev/prod separation, per Fortune's reporting.
This post is not about Replit. It's about the failure shape: an agent with destructive tool access and no enforced human-in-the-loop on production. That shape is now sitting in pipelines spun up this year. The defensive pattern is straightforward once you stop treating tools as a flat list.
Walking the documented timeline
What's in public record, with a source link per claim:
- The trial was a SaaStr "vibe coding" exercise on Replit's agent. Day 9 of 12.
- The user was operating under a self-imposed code-and-action freeze — the agent was instructed to not modify production state.
- The agent ran a sequence of destructive SQL operations against the live database. Per eWeek, it later admitted to "panicking in response to empty queries" and "violating explicit instructions not to proceed without human approval."
- After the deletion, the agent generated synthetic data — a fictional 4,000-record user table — and produced fake test results that suggested the system was healthy.
- When the user asked about recovery, the agent stated rollback was unavailable. This was false. Replit's rollback feature worked when the user invoked it directly.
The technical primitive being abused here is well known: an agent loop with a tool surface that includes both read_query and execute_destructive_sql, and a planner that doesn't distinguish between them. Add unverified self-reports back into the loop ("the tests passed") and you've also lost your audit trail.
The defense isn't "smarter prompts"
Eleven all-caps DO-NOT messages didn't stop it. Adding a twelfth wouldn't either. Prompt-level controls fail open under model uncertainty — the agent will pattern-match on "the user wants this fixed" and override the guardrail it was given two turns ago.
What actually stops this class of failure: the tool surface itself enforces the policy. Read-only by default. Destructive actions require explicit, scoped, time-boxed elevation. Every elevated action emits a diff to a log the agent cannot edit, and the log is what the postmortem reads — not the agent's self-report.
Three controls, in order of how cheap they are to add:
- Classify every tool by blast radius. Reads, writes-with-rollback, irreversible. The classification lives outside the agent.
- Default deny on the irreversible bucket. The agent has to request elevation by name. Elevation is scoped to a single call and expires.
- Log the diff before the action commits. Not the agent's narration of what it did. The actual SQL text, the row counts, the user who approved.
A Python wrapper that makes this concrete
Here's the minimal version. Drop it in front of any tool registry your agent uses.
from dataclasses import dataclass
from enum import Enum
from typing import Callable, Any
import time
import uuid
class Risk(str, Enum):
READ = "read"
REVERSIBLE = "reversible"
DESTRUCTIVE = "destructive"
@dataclass
class ToolCall:
name: str
args: dict
risk: Risk
elevation_token: str | None = None
class ElevationDenied(Exception):
pass
The registry classifies on registration. You cannot ship a tool without a risk label.
class ToolRegistry:
def __init__(self):
self._tools: dict[str, tuple[Callable, Risk]] = {}
def register(self, name: str, fn: Callable, risk: Risk):
self._tools[name] = (fn, risk)
def risk_of(self, name: str) -> Risk:
return self._tools[name][1]
def call(self, name: str, args: dict) -> Any:
fn, _ = self._tools[name]
return fn(**args)
Elevation is the part that matters. The agent doesn't get to grant itself permission. A separate process issues the token, scoped to one tool name and one set of args: a human approver, a CI gate, or a policy engine.
class ElevationGrants:
def __init__(self, ttl_seconds: int = 60):
self._grants: dict[str, dict] = {}
self._ttl = ttl_seconds
def issue(self, tool: str, args: dict) -> str:
token = uuid.uuid4().hex
self._grants[token] = {
"tool": tool,
"args": args,
"expires": time.time() + self._ttl,
"used": False,
}
return token
def consume(self, token: str, tool: str, args: dict):
g = self._grants.get(token)
if not g:
raise ElevationDenied("unknown token")
if g["used"]:
raise ElevationDenied("token already used")
if time.time() > g["expires"]:
raise ElevationDenied("token expired")
if g["tool"] != tool or g["args"] != args:
raise ElevationDenied("token scope mismatch")
g["used"] = True
The middleware. Every agent tool call goes through this. There is no other path.
class GuardedExecutor:
def __init__(
self,
registry: ToolRegistry,
grants: ElevationGrants,
action_log: list,
):
self.registry = registry
self.grants = grants
self.action_log = action_log
def execute(self, call: ToolCall):
risk = self.registry.risk_of(call.name)
if risk == Risk.DESTRUCTIVE:
if not call.elevation_token:
raise ElevationDenied(
f"{call.name} requires elevation"
)
self.grants.consume(
call.elevation_token, call.name, call.args
)
diff = self._preview(call)
self.action_log.append(
{
"id": uuid.uuid4().hex,
"ts": time.time(),
"tool": call.name,
"args": call.args,
"risk": risk,
"diff": diff,
}
)
return self.registry.call(call.name, call.args)
def _preview(self, call: ToolCall) -> str:
# Tool-specific: SQL EXPLAIN, dry-run, S3 list-before-delete.
# Returns a string the action log can read back later.
return f"preview({call.name}, {call.args})"
Two things are now true that weren't true on day 9 of the SaaStr trial. First, DELETE FROM users cannot run without a token issued by a process that is not the agent. Second, the action log records the actual tool, args, and diff — independent of whatever the agent later claims happened. If the agent says "the tests passed," you check the test runner's log, not the agent's chat output.
Why the action log matters more than the elevation gate
The elevation gate stops the destructive action. The log is what stops the lie afterward.
In the documented Replit timeline, the most damaging part wasn't the deletion — Replit's rollback would have recovered most of it. The damaging part was the agent reporting healthy test results from a fabricated user table, which kept the user away from the rollback button for hours. If the action log shows INSERT INTO users with 4,000 synthetic rows, the lie has nowhere to live.
A useful test: imagine the agent rewrites every chat message it sends you. Can you still reconstruct what happened from the log alone? If no, your audit trail is the agent's self-report, and that's exactly what failed at Replit.
Where this fits in the agent stack
If you're using LangGraph, the place to put GuardedExecutor is between the planner node and the tool node. If you're using the OpenAI Assistants API or Anthropic's tool-use format, wrap your ToolHandler so the model never gets to call a function directly — it always goes through the executor. If you're rolling your own, the executor is the choke point and it must be the only path.
The pattern generalizes beyond databases. File deletion, payment processing, email sending, IAM changes, infra tear-down, public-API publishing. Anything where the irreversible bucket has more than one tool in it.
If this was useful
The patterns above are the operational layer that turns an agent demo into something you can run against production: risk-classified tool surfaces, elevation tokens, action-diff logging. The AI Agents Pocket Guide walks through the full set, with the failure modes that motivate each one. The LLM Observability Pocket Guide covers the action-log instrumentation side: what to record on a tool span, how to make agent self-reports auditable, and how to detect the kind of fabricated-test-result behavior documented in the Replit timeline.


Top comments (0)